ETL Tracers
Firing visible data at a target to see if you hit
ETL (Extract-Transform-Load) pipelines are notoriously complex and opaque. I've always wanted something to pierce the opaqueness I've experienced on many ETL projects. I'll take a wild guess I'm not the only one. What follows is an idea I had around five years ago to pierce that opaqueness along with the experience I've had implementing it.
If you've watched any war or action movies, you've seen tracer rounds. These are rounds designed to light up so that the person firing knows where their rounds are going downrange. Typical rounds are invisible to the eye. For more info, see wikipedia. This is an idea that we can apply to ETL pipelines to give a sense of whether the invisible data is "on target."
We can apply the same concept to give us visibility into the inter pipeline. Using traditional monitoring solutions it's easy to see if services are up and running or the underlying box is struggling. It's much harder to see whether the data flowing through is correct.
Why not make all records visible?
Not all rounds in a given magazine are tracer rounds for a few reasons. Tracer rounds are less lethal because they contain less mass than a standard round. They're more complex (more expensive) to manufacture. It's also not necessary. From what little reading I've done, around every 5th round is a tracer round. That's enough to give the gunner a sense of where they're shooting without the downsides of every round being a tracer.
A similar effect happens with data packets (record, a batch of records, row, etc) in an ETL pipeline. Logs/metrics have extra overhead, especially if you're emitting an entire data packet. That overhead comes in the forms of network, processing, storage, and cognitive load when trying to interpret the results. It's also not necessary as long as you have a representative sample of the data flowing through. All we need to do is tag every n-th packet of data with a "tracer" flag and make sure all our code emits those tracers when they're seen. Otherwise, process them as normal.
Zoom and Enhance!
What if we could take the idea one step further? If we emit data as it's already flowing through the pipeline, we have to do extra analysis when we examine those packets to determine if there are any errors in the pipeline, etc.
Instead, if we hand-craft tracer packets it gives us absolute control over the content of the packet. We can cover edge and corner cases and we know what the result should be at every step. Think of this with the same mindset you craft unit tests.
Also, we can encode this information for every step in the pipeline. The results of each of the extract, transform, and load steps should look very different from one another. If we do this, then each step in the pipeline can emit a diff of what the output should be versus what it actually is.
Wait, this is just tests?!
Yep. But it's testing you have the option of running in any environment (including production). It's tests you can use in development if you want to throw a new data source into the pipeline. It's regression tests that automatically tell you if you're now casting something to an int where it was a string before. If you're running it in production, it can be a monitoring system for your ETL pipeline. I.e., if you haven't seen tracers in n-minutes at the loader, either the transformer is having problems or the transport mechanism between the transformer and loader has failed. It narrows down the number of places you need to check to look for failures.
Benefits of Tracers
In my own implementation of the idea, we found that local development of new data sources was orders of magnitude faster. In addition, we had regression tests we could run at a moment's notice to confirm that the 40+ data sources we were using hadn't broken because we fixed a bug with the 41st source.
We had several data sources change their format without telling us. Using the tracer concept we were able to create a new tracer based on the new format and restore the source within a few hours.
So... how do I create a tracer?
My process is pretty simple. I take a representative record from the source data and then hand-craft what I expect it to look like at the extract, transform, and load steps. Like any good test I then take that representative record and change the input to explore the testing space. Any good set of tests should hit happy paths, edge cases, corner cases, and known failure cases. Tracers are no different.
What are the components of a tracer?
Source Data
This is as close as you can get to what the source data looks like. Remember, this record is going to be flowing through your system like a regular record would.
Extract/Transform/Load Expected Data
This is the expected output of each of these three stages. Include only the ones you feel are necessary. In my case, the transformer is the most critical step and as such, we crafted tracers around identifying transformer problems.
Example Tracer
tracer_happy_path:
input_record: "id,name,cost\n1234,test name with single' quotes,1234.5"
transform_output_record:
id: 1234
title_case_name: Test Name With Single' Quotes
float_price: 1234.50
Awesomesauce, where can I get this code?
The idea isn't well-formed enough to turn into a library yet. It requires the ETL pipeline know too much about the tracer implementation for the two concepts to decouple them at the moment. I was, however, able to implement the idea in less than 200 lines of Python code for my particular case.
Lessons Learned in Implementation
- Hand-crafting these data packets can be tedious and annoying. They're hard to get right.
- You need to drop tracer packets need to just before they'd end up in permanent storage. I.e., if you're inserting into a relational database, run all the inserts/updates/etc, and then roll back the transaction.
- Depending on the language and implementation of this idea, the code needed to handle the tracers can add a lot of visual noise to the code. Obviously, this is something that can be refactored and fixed, but it is a hazard. In my case I used Python context managers and function decorators.
Key Ideas
- Tracers are intentionally crafted data packets that the ETL pipeline knows to emit a diff between expected and actual.
- Tracers should flow through as much of the ETL pipeline machinery as possible including storage.
- Tracers should either be ignored by downstream processes or they should be deleted/rolled back from storage.
- Think of tracers as constant end-to-end tests. They should cover as many edge and corner cases as possible.
- Tracers act as an early-warning system of deviated behavior, code regressions, data integrity issues.
- Tracers act, as a side effect, as a monitoring system for the ETL pipeline (if you're not seeing tracers every n-minutes, something's wrong).