Can monitoring replace testing?

3 min read

Mikk Soone

BY Mikk SooneOct, 8, 2020

See original post HERE

How do you catch unforeseen errors in the system as fast as possible, for example, if your API stops working? Testing will reduce the number of issues you will encounter; however, the most complex ones are the ones that cannot be tested. Be prepared and avoid the illusion of better quality with more automated tests - sometimes you have to invest in different testing strategies by improving the monitoring setup (which actually is tooling for testing in production). In order to give you some good tips, we wrote an article about how to combine monitoring and testing.

Use balanced testing

When delivering high quality software, traditional testing is not enough. You need balanced testing with some possible edge cases that are monitored instead of tested. Why is writing the tests not enough?

  • 100% of code coverage is time-consuming with no real benefits
  • 100% of edge cases are impossible to think of
  • 100% issue-free software does not exist

You can write unit tests for complex logic, integration tests for your API, maybe even some UI tests, and from time to time, testing manually is fine as well. However, software systems are complex and in some cases the resources are better spent on detecting errors instead of pre-testing.

Example: higher customer gratification

Consider a case where your external partner is suddenly changing their API which you have used for integration. Your requests start to fail. If they didn’t notify you ahead of time (which happens), you cannot really test for that. Or if the user has a faulty social security number not matching the standard - your software might handle this case fine, but if it doesn’t, then at least detect the problem. You can’t avoid all of the problems, but if something happens you can let the user know. For example, notification can be sent through an email that you are aware of the problem and it will be fixed in 30 minutes. The user will be impressed by this kind of personal communication because it’s not common.

Four steps you should prepare

Test what is reasonable to test since the unthinkable will happen anyway - and when it does, you want to catch it before the user reports come in. How to be prepared:

  1. Apply zero-exception policy (no-more-silent-failures), detecting all application problems. Alert yourself immediately and find the root cause. Most logging stacks have a feature to send notifications based on search string, such as “exception”. Sentry has made a service out of it - try using it. Catching intermittent issues early saves from bigger downtime later on - so check each and every one of these issues.
  2. Monitor the performance in different layers of your system. Emphasis on the bottlenecks - database monitoring can be difficult, but is essential.
  3. Add end-to-end tracing. Check the anomalies.
  4. Use synthetic API/interface monitoring to detect happy flow problems. Those test if users can log in, view and click the main buttons. If this is failing, you know there is a critical issue with the system.

Constantly improve your monitoring

Make sure you have all means possible to debug different types of issues. With every issue that is tricky to debug, analyze the hacks needed to find the root cause and think about what could be improved in the future. If you wished you would have had a specific type of monitoring graph, now is the time to implement one.

Implement a release process that allows you to fix production problems quickly with confidence. Use feature flags to turn new functionality on and off, on demand. If you can get your mean time to recovery fast enough, the service desk will not get hammered and customer love continues to grow.

In conclusion, testing will reduce the number of issues you will encounter; however, the most complex ones are the ones that cannot be tested. Be prepared and avoid the illusion of better quality with more automated tests - sometimes you have to invest in different testing strategies by improving the monitoring setup (which actually is tooling for testing in production).

Quick list of testing software that we use in our Concise team:

  • For central logging - Splunk, ELK, Humio
  • For OpenTelemetry - Jaeger, Zipkin
  • For cloud monitoring - Datadog, Prometheus, Grafana
  • For application errors - Sentry, Bugsnag
  • For alerting - Opsgenie, Alertmanager
  • For synthetic API monitoring - Datadog, Apica, Postman
  • For frontend flow monitoring - Google Analytics, Mixpanel, Firebase
  • For app flow monitoring - Firebase