Our initial idea to migrate away from Splunk was driven from the fact thatautomating the deployment of the forwarding agent is cumbersome.
You have to go to their web interface and download the installer (a 200 MB package) and store it in your artifactory to enable automatic installation, when provisioning new infrastructure.
This has to be done every, single, time, a new version is released -> cumbersome and time consuming, which directly contradicts the ideas of IaC (Infrastructure as Code).
There had to be a better way.
When we started to look for alternatives, we agreed on a few basic requirements that were non-negotiable.
First, it had to be a SaaS (Software as a Service) offering, similar to Splunk - we didn't want the added burden of having to provide tamper proofing evidence for the PCI-DSS audit and another system to keep online/monitored/secured.
Secondly, it had to have good support for regular expressions based searches and extracting new custom data fields from plaintext logs - matching Splunk’s capabilities in this regard was going to be a tall order, but even looking for a close second was going to narrow the list of suitable candidates to very few items.
Thirdly, the cost - if you're looking for something that doesn't quite match Splunk's feature set, you should also be able to pay less for it.
These three simple requirements yielded a short list of candidates indeed - one item - Humio
Most of the candidates that we considered were some modification of the classic ELK stack (Elasticsearch, Logstash, Kibana), almost all except Humio - this became their downfall.
Our tests showed that Kibana's feature set (read: second requirement) was far below Splunk's capabilities and Humio's abilities fell somewhere between the two.
Saying that Humio is a close second to Splunk would be a stretch, but in the test cases that we used, it outperformed all the other offerings that were using Kibana.
Cost wise, I won't get into specifics and your use-case may yield different numbers, but for us, with the same usage,Humio would cost about 40% of what we were paying Splunk.
The ace up Humio's sleeve was that (similar to other ELK based offerings) you could configure the OS to have an additional software repository for the forwarding agent installation and set-up your IaC to drop in a configuration file for it - configure once, update/upgrade forever, with no manual intervention required.
Having a clear winner and the prospect of cutting 60% off the monthly bill for logging was an easy sell to the upper management - so we migrated.
As previously stated, installing the forwarders is a breeze - just add the Elasticsearch package repository, install the relevant beats and add the configuration file(s) to forward logs to Humio's ingestion service.
It took us less than a day to configure all hosts to forward their systemd-journald logs to Humio via journalbeat. System logs sorted, we moved on to application logs - this is where we ran into some trouble. Most of the logging implementations of the applications pre-dated systemd-journald and hence logged directly into their own log-files.
Oddly enough, this was also true for the new applications deployed to Kubernetes - logging to systemd-journald has since become available for pods, but we don't use it for a few other reasons that I won't get into.
On the surface, the fix is simple - install filebeat and configure it to parse the event timestamp from the log-line and forward it to Humio.
In reality, it's not quite that simple - turns out, each application has a slightly different logging format and extracting additional data fields from it is either impossible with filebeat's configuration language or will invalidate the claim of not tapering with the logs.
The solution - extract additional fields after Humio has received the logs. Once again - simple in theory, but not in practice.
For simple applications, it's not too painful to create a regular expression and have Humio's parser extract all the interesting data fields; for more complex applications (that sometimes entail multiple sub-applications), the regular expression gets ugly, fast.
We had greatly under-appreciated the things that Splunk does behind the scenes - a lot of the field extraction regular expressions, that we now had to write for our logs to make sense, it had done automagically.
Humio's web interface also has a lot of room for improvement. The patterns, how our end-users operate, don’t quite match with what Humio has intended - this causes some workflows to be needlessly cumbersome and in some cases causes the web-interface to become a resource hog. At this point in time, we’re not certain if this is something that will be addressed in an upcoming release or we’ll need to look for a new tool.
Mistakes were made and in hindsight we could've done things better. My main regret is that we made two rookie mistakes:
Firstly, prior to the migration, we didn't perform an EST (Enterprise Scream Test - in an enterprise environment, people don't give you valid feedback if you just ask for it; once you unplug something, people will come screaming). Before switching, we merely asked our developers to check out Humio and it's usability -only after the migration, when people were forced to use Humio, did most of the issues with UX come into light. For our next migration, we'll disable access to the current system a lot earlier and force the end-user to actually try out the new solution, before committing to it fully.
Secondly, we looked at the monthly bill and not the TCO (Total Cost of Ownership). The monthly bill is only a part of the expense that a system brings to your organisation - yes, we reduced the monthly bill and eliminated my team's frustrations with updates, but at the same time the work-hours spent by developers to search for logs has skyrocketed. Taking into account that each developer spends multiple hours of work, every week, to look at logs and they outnumber my team 10-to-1, the cost-benefit ratio of my team spending 0 hours of maintaining the log forwarding setup (as opposed to the previous ~1h/month),is skewed towards Splunk being the better proposition. This comparison doesn't even take into account the cost of running two systems in parallel for the migration period, on-boarding, training and other similar expenses that are part of each move to a new system.
Looking back, I can't say that moving to Humio was a good decision, nor can I say it was a bad one.
We learned and evolved quite a bit with this journey - standardised logging format is starting to get implemented, we now know to include our developers (read: end-users) early-on in the process of validating new tools and we have a better understanding of what we need from a central logging solution - looks like neither Humio, nor Splunk are a perfect match.
All-in-all, if you perform a large systematic change and make no mistakes in the process of getting there, have you made any progress towards greatness at-all or just some busywork?