airbyte engineering
2024-04-04 - Originally posted at https://airbyte.com/blog/record-change-history
↞ See all posts
Airbyte’s job is to move data between systems in the best possible way. But what exactly does "best" mean in this context? It embodies the balance of many, often competing goals. One of those balancing acts is the compromise between ensuring the highest level of data precision and achieving compatibility across various systems.
In the realm of data movement, one of the most important aspects we deal with is data compatibility. The Airbyte Protocol describes a series of data types that all sources must serialize their content into while in transit, and we require that all destinations be able to store every one of these types. This creates a common language and it’s how destinations can be interoperable with as many sources as possible.
This setup not only enables compatibility across diverse systems but also allows our platform to effectively validate data from any source and offer features like column selection for every sync process.
However, these data types, while deliberately chosen for maximum compatibility, are limited. For instance, you might notice the absence of a decimal type. This choice is a strategic one, aiming to ensure that every source and destination can work with the available types. Yet, even within this well-thought-out system, we sometimes encounter challenges.
At Airbyte, we’ve been working to ensure that one bad record won’t break your sync, and that lead us to the addition of Record Change History. This new feature offers a way to inform you that a record was modified in transit, to prevent such record from otherwise being un-syncable.
When we announced Destinations V2, we called out that this opened the door to new error-handling capabilities, and today we are happy to share one of them!
Record Change History can best be demonstrated with an example. Imagine a record like this in a Postgres database:
1{ 2 type: "RECORD", 3 record: { 4 stream: "users", 5 emitted_at: 123456789, 6 data: { 7 id: 1, 8 first_name: "Evan", 9 last_name: "Tahler", 10 description: "Hello, my name is Evan, and I like long walks on the beach, but also computers and then also... (25MB of text follows)" 11 } 12 } 13}
In this case, the description column from the users table in this Postgres database holds a very large text entry. Before Record Change History, the source-postges database connector would try to serialize this record and probably succeed. But then, the Snowflake destination would have trouble since Snowflake semi-structured data can only be 16MB, causing the entire sync to fail due to this single oversized record.
Previously, the only workaround would be to use the column selection feature to skip the description column entirely, but then all other rows, with reasonably sized content, would be skipped as well.
Now, with Record Change History, we have the tools to allow Airbyte to modify a record in-transit to solve certain classes of problems which we know won’t be able to make it all the way to the destination. In the previous example, the excessively large description would be nulled, and this modification would be transparently communicated to the users. These changes are recorded in a query-friendly manner in the destination, keeping you informed and your data syncs uninterrupted.
The record with modifications, by the time the destination will store it, becomes:
1{ 2 type: "RECORD", 3 record: { 4 stream: "users", 5 emitted_at: 123456789, 6 data: { 7 id: 1, 8 first_name: "Evan", 9 last_name: "Tahler", 10 description: null // <--- changed! 11 }, 12 meta: { 13 changes: [ 14 { 15 field: "description", 16 change: "NULLED", 17 reason: "DESTINATION_FIELD_SIZE_LIMITATION" 18 } 19 ] 20 } 21 } 22}
And that means that you’ll now have new information about each record’s change history in your V2 Destinations’s final tables:
This approach has a number of advantages:
1// Check for changed records 2SELECT COUNT(1) from users_final_tale where length(_airbyte_meta.changes) > 0
Today, Record Change History is used within Airbyte for the following 3 classes of problem:
Each of these types of problems will have a unique error reason in your destination data warehouses.
Over the next few months, we will be adding support for Record Change History to all certified sources and destinations at Airbyte. This is just one of many projects underway to dramatically improve our reliability, even in the face of strange data!
Do you have any questions or feedback for us? You can keep in touch by joining our Slack channel. If you would like to keep up to date with new Airbyte features, subscribe to our newsletter.
I write about Technology, Software, and Startups. I use my Product Management, Software Engineering, and Leadership skills to build teams that create world-class digital products.
Get in touch