Modernizing the Data Movement

juuni, 26, 2023 • 5 minutit lugemist

Erkki Pärna

Engineering Manager for Data teams

One of the prominent tasks of Data Engineers is to connect new sources and build new data pipelines between a source and data warehouse. It is also about maintaining resources and updating configurations. In this blog we discuss some benefits that we have experienced when moving over from custom and open source ETL (extract-load-transform) solutions to managed alternatives.

Presented in Data Innovation Summit 11th of May 2023

There have been numerous articles written about how the modern data platform should look and what could be the adequate choices, to build a scalable, reliable and governed platform. Various choices can be made according to the size and data intensity level of your organization.
At one of our major projects related to EasyPark Group, we have preemptively followed the path that has become mainstream today. Our central data warehouse is in Snowflake Data Cloud and transformations and modeling are done in the dbt layer. The platform is governed via Terraform scripts and orchestration is handled by Apache Airflow. Erkki Pärna at Data Innovation Summit

For quite a while we have been using singer.io based open source tool PipelineWise or custom python scripts for ETL processes. On one hand, it has been cost effective as we don’t have to spend any money on the essential tools, but on the other hand, these solutions are difficult to scale and require active maintenance. In 2021 Easypark acquired ParkNow Group which meant that the company entered the fast track migration phase. The Data Platform team felt the pressure of stakeholders to deliver faster and more reliably the data from different sources. Increased data volumes in our most critical databases caused problems in our data pipelines as they started to break more often, needing lengthy full historic syncs taking up to 11 - 12 hours. The pipelines were also prone to errors when schema or structure change wasn’t communicated by the product team, causing the sync break and stakeholders to face downtime in analytics. We witnessed a growing amount of maintenance overhead because some specific knowledge was needed on how to manage Kubernetes resources and it was also time consuming to identify errors. Onboarding new people to the data platform needed DevOps skills and was a lengthy procedure.

Erkki Pärna's presentation

Rather than just increasing the headcount and adding more people to look after our ETL process, the team started looking for alternative solutions. We had 8 different managed ETL providers under investigation and testing phases. Fivetran stood out for us because it offered multiple methods for ETL that weren’t based on singer.io but seemed more effective. We had tested Fivetran a year ago when we tried to connect our Mixpanel data to Snowflake. Back then we were unsuccessful because Fivetran wasn’t able to deal with historic data sync that contained approximately 60 billion events. However, we decided to give it another try because our new colleagues from the US team were very positive about the way the tool works for them. Of course we also had some concerns that surfaced during that process. For example how Fivetran can effectively support us in two to three different realms (EU, US, UK, Australia) while the main account is in the EU? How secure is it to pass data through the third party service provider? How much load would it put on our operational databases during the replication process? And of course the main question is how much it's going to cost and how we can justify this expense?

All the concerns we had were alleviated. Fivetran promptly supported us more or less around the clock, even though the main account is in the EU. All the connections to the sources and destinations are SSL (Secure Socket Layer) encrypted by default and they also have transparent customer data retention policy. Different sources have different replication methods available and the ones we have been using on our Aurora Postgres, MS SQL Server and MySQL databases haven’t put any significant load on the operational databases that could affect their performance.

There have been several benefits we have experienced over the last six months by moving over to the managed ETL tool.

Data insight

Quicker integrations

We have over 80 data sources connected using 15 different connectors. We can handle all the connections globally from the one UI by any team member and onboarding to Fivetran is quick because documentation is well prepared.

Satisfied stakeholders

90% of the stakeholders’ use cases can be served with reliable, less than 5 min, near real time sync frequency. In our organization there aren’t many analytical use cases where we need to manage real time data streams via a data warehouse yet.

Savings on Snowflake compute

Our comparisons are showing that we are using 3x less compute with a managed ETL solution compared to the open source solution as we have now more ways to replicate the data. Moving over from full table sync to CDC (change data capture), CT (change tracking) or incremental method is giving us significant savings, before we could rely on these methods.

Maintenance

Another major change we have seen is that there is no overhead for keeping the lights on maintenance. Alert system is notifying us about the errors that occur and most of the time the system is smart enough to recover itself from these errors. We don’t need to have any special developer knowledge for maintaining the tool itself so onboarding new people has become a much easier process than it was before.

Starting to use more advanced managed ETL solutions such as Fivetran has given the team an opportunity to be more confident on the new data integrations. We sleep better because we know the pipelines are resilient to most of the problems and we are alerted timely if something goes wrong. More importantly we can feel supported and there is a prompt action taken on any of our requests around the world. Our stakeholders are happier that new integrations are not set up within weeks or months anymore but within days and there has been a significant decrease of downtime. In general there has been a mindset shift in the whole organization where the objective is not to grow the teams while business is growing but use the tooling and automation. We have specially seen the benefit of it in cloud engineering and infrastructure work. If we are able to reduce time spent solving ETL issues the time can be used creating value in some other more lucrative areas such as developing ML platforms for the Data Science team. Managed services cost money, but in our experience the benefits that the organization gains are far more important and valuable. For continued growth and ease of scale this investment is a wise choice.

Modernizing the Data Movement

Presented in Data Innovation Summit 11th of May 2023

Quicker integrations

Satisfied stakeholders

Savings on Snowflake compute

Maintenance

eelnevad postitused

Embracing Sustainable Approach in Software Development

autor Priit Pihus, Hannu Kikkas

Cooperation of operational database owners and analytical data teams in modern data platform

autor Terje Russka, Erkki Pärna

My developer wants to restructure it

autor Tiit Kuuskmäe