The SeatGeek Data Team is proud to present…
We couldn’t be more excited about our newest open source release. Today marks the coming of age of a tool that we’ve used and appreciated for years.
If you speak Russian or Ukranian, you may have guessed at Druzhba’s purpose just from the name. Druzhba is the slavic word for “friendship,” in addition to being the name of the word’s largest oil pipeline.
Our Druzhba efficiently carries data from production databases to data warehouses. At SeatGeek, it serves a crucial role in our data systems, extracting data from hundreds of tables in our transactional databases and loading them into our analytical data warehouse where they can be leveraged by the business.
We’ve prioritized simplicity and usability in building Druzhba, and the result is a tool that is easy to deploy and maintain. A Druzhba pipeline is defined by a directory of YAML files that specify the connections between the source and target databases. Over the years, we’ve added features that we found ourselves wishing for. Some tools, for example, will require you to enumerate all the fields you want to copy, but Druzhba configurations specify columns not to copy. Not only does this reduce upkeep and lines of code, it allows us to succinctly exclude PII and other fields we’d rather leave out of our DWH.
Worried about copying from various source databases? Don’t be. Druzhba can unify your extract and load processes even if your production databases use more than one RDBMS, and will help you persist data from production databases with limited retention.
Running Druzhba is simple as well. Executing
druzhba from your command line
will pull data from all your tables in all your sources databases. Running
druzhba --database my_db --tables my_tables
would allow you to run just a subset of your pipelines. There are of course additional CLI options, and the instructions
for both configuring and running Druzhba can be found on our Read the Docs page.
You may have noticed that we used the words “extract” and “load” above, but there has as yet been no mention of “transform.” Though Druzhba does support limited in-flight transformations, it isn’t primarily built according to an “Extract-Transform-Load” (ETL) pattern. Instead, it’s designed to provide the first two steps in an “ELT” paradigm. Like many other organizations, we have come to strongly prefer ELT to ETL.
Recent years have seen tremendous improvements to database technology. Modern columnar Data Warehouse products are incredibly powerful, and if a transform step can be expressed in SQL, data warehousing products are likely to achieve better performance than any other tool.
In the absence of technical limitations, we’ve found that the right place for a transform step is almost always as close to the end of the pipeline as possible. Transformation logic, especially business logic, tends to change. Having as few pieces of the puzzle live dowstream of the transformation makes these changes easier and safer to implement. So we perform our transformation steps - with the help of DBT - in our analytical data warehouse after Druzhba’s work has been done.
Our data pipeline uses Druzhba for copying relational data from production to analytics while also leveraging Luigi to load data from cloud services and manage our more complex dependency graphs. This is a pairing that’s worked quite well for us. If you’re just starting out, or if the maturation of your stack has left you wishing for a clean alternative to your homemade SQL ETL, Druzhba could work well for you too. For more on how best to use Druzhba, check out the Readme.
Now that Druzhba is available for general use, we are looking forward to learning from your extensions and implementations. If this is the sort of product you’d want to spend more time with, consider applying to our open Data Engineer role, and stay tuned for more exciting SeatGeek engineering updates here on ChairNerd!