Have you ever released something and immediately after deployment, sneaked into the production server to check the logs for any exceptions your bulletproof code might have produced?
Have you stared in suspense at the logs and after finally realising they are clean of exceptions, breathed a huge sigh of relief and finally got a good night’s sleep?
Sadly, from my experience, this is an all-too-familiar scenario among developers. Yet this does not need to be the case! There are tons of tools out there nowadays making it easy to set-up a framework that will catch any exceptions that occur and notify you according to your schedule and preferences.
As a starting point, let’s focus on the following motto: “No production error shall go unnoticed!”. Picture this, almighty Gandalf, waives his staff to all the little exceptions flying by, yelling “You shall not pass!” to a big, greasy error-boss that could potentially mess up your entire data if not stopped right away.
Of course, your primary goal should be catching all potential errors before the release. But sometimes, even 100% test coverage does not give you the immortality that you seek. There is too often a nasty corner case, or 3rd party integration that behaves erratically.
Production errors do occur, the question is, what are you going to do with them?
For our current project, we’re using Splunk as a log management and analysis tool, but there are many other applications that will do the job just fine. You just need the ability to create alerts based on text phrases, regular expressions, or any other situations that can be discovered by reading the logs. We usually start by creating an alert that goes off with every error-level log record that occurs.
Yes, there could be cases where we don’t want an alert for every exception imaginable, but we’ll dig into that a bit later. The main thing here is to start with all of them and, if needed, we can filter some out. This is different from the norm because in this case, alerts are created for specific situations. If something unexpected happens, it will fail silently.
So, you have a specific scenario, where there is an error in the logs, but it is expected and nothing needs to be done about it. My first question would be, why is it an error in the first place? Maybe it was a validation exception of some sort. Of course, we expect the user to produce invalid input from time-to-time — so it might be reasonable to use warning-levels for logging such exceptions?
The other solution would be to set-up exclusions for an alert. Let’s say, the exception comes from some 3rd-party integration being down, but we know that this can happen and have a mechanism in-place to try again later. In this case, we definitely do not want to be woken up in the middle of the night because the first try failed, only for it to succeed the next time. An exception in the logs is fine here, 3rd-party being down is an exception in its essence for sure. A good way to exclude such situations is to catch it in your code and throw a more specific exception like “ThirdPartyRequestFailedException” or “RetryableRequestException”. You can then exclude this specific exception from your alert and not some framework’s general HttpException or similar than can come from hundreds of other places.
You should be very careful when excluding some errors from an alert, as you have to be sure you won’t miss a situation where the exception was thrown from a different place than you were expecting. By excluding errors, you might also exclude situations where your release actually messed something up. Let’s take a look at the previous example — your code throws an exception when the 3rd party is down… but it’s okay, the code will retry and we don’t want to be notified of every single failure. But what if this time it is not the 3rd party’s fault, and we messed something up with the integration ourselves?
We have found a good balance on excluding errors, while still having the chance to catch our own wrongdoings. One failed connection to 3rd party is ok, 4–5 might be alright, but when this error happens, say 100 times per minute, it’s probably a sign that something isn’t right. Or if it happens 1000 times in 60 minutes — it depends on the situation. The important thing to notice here, is if we exclude an error from the alert, we’ll try to create another alert that would go off when it is thrown “more times than reasonable”.
It’s good practice to separate one-time expected failures from bigger problems.
Your log-management tool can usually notify you in several ways. It can send an email or IM message to your chat, create a new ticket, call you or send an sms. I’d suggest avoiding plain emails, as they usually are not processed quickly enough and it’s hard to know if anyone is already handling the situation. For my current project, we’re using Slack as an IM provider and all alerts from Splunk end up in a dedicated slack channel. From there, you have a choice — for us it works voluntarily: if any developers see an alert, they will process it, marking it with a specific icon so that others know it is being handled or fixed. Alternatively, you can have a dedicated person each week who takes responsibility for processing them.
The main thing is, they’re all there, staring you right in the face, asking to be dealt with.
Fixing problems that present themselves on a Slack channel is neat, but sometimes bad things happen outside office hours that require immediate attention. For prioritization, we use a great tool called Opsgenie, where we have an on-call rotation configured. So, for applications that are more critical, Splunk sends alerts to Opsgenie and the latter will decide its priority based on our configuration. Opsgenie can call your phone immediately, or it can even escalate the situation and notify the entire team. You can configure this to deal with lower priorities too. For example, it can be set to make a call during office hours, send an sms in the evening and be silent at night, waiting until morning to notify you.
I think the benefits for having an alert for every production (and why not staging?) error are quite obvious. We’re fans of continuous deployment and #superagile, so getting immediate feedback when something is broken is essential for us. We can also do a lot to help out end-users: imagine you’ve got an error in the application you were using. Your IM pops up and a developer lets you know that they’re already on it — instead of having to report this to the service desk and possibly wait for ages to get it fixed. One important reason to react quickly is that an error can create a lot of damaged data, which is quite hard to fix, especially as time goes on into days and weeks.
Let’s wrap it up — you should not have any errors in production that no-one is notified about and that will not be taken care of right away. Start with setting up necessary tools for analyzing the logs and creating alerts based on them. You can exclude some stuff, but be sure the rules for that are as specific as they can be. You still catch them if they pile up or behave in some other unexpected way. If you have a legacy system in your hands, you can start with having alerts for all errors and fix them gradually before the whole team is alerted. As a team, agree on priorities and how you will handle the errors, making sure nothing escapes.
In the end, this job will become quite easy, as you wont have more than x alerts per week, where x is quite a small number, if not zero!