Datafold, a data reliability company, today announced data-diff, a new open source cross-database diffing package. This new product is an open source extension to Datafold’s original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.
In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases. Datafold has been focused on automated testing during the transformation step with Data Diff, ensuring that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data. With the launch of open source data-diff, Datafold can now help with the extract and load part of the process. Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the extract, load, transform (ELT) process.
“data-diff fulfills a need that wasn’t previously being met. “Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process, and although multiple vendors and open source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.
Gleb Mezhanskiy, Datafold founder and CEO
Mezhanskiy continued, “data-diff solves this problem elegantly by providing an easy way to validate consistency of data sets across databases at scale. It relies on state-of-the art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open source tool, it can be easily embedded into existing workflows and systems.”
Answering an Important Need
Today’s organizations are using data replication to consolidate information from multiple sources into data warehouses or data lakes for analytics. They’re integrating operational systems with real-time data pipelines, consolidating data for search, and migrating data from legacy systems to modern databases.
Thanks to amazing tools like Fivetran, Airbyte and Stitch, it’s easier than ever to sync data across multiple systems and applications. Most data synchronization scenarios call for 100% guaranteed data integrity, yet the practical reality is that in any interconnected system, records are sometimes lost due to dropped packets, general replication issues, or configuration errors. To ensure data integrity, it’s necessary to perform validation checks using a data diff tool.
Datafold’s approach constitutes a significant step forward for developers and data analysts who wish to compare multiple databases rapidly and efficiently, without building a makeshift diff tool themselves. Currently, data engineers use multiple comparison methods, ranging from simple row counts to comprehensive row-level analysis. The former is fast but not comprehensive, whereas the latter approach is slow but guarantees complete validation. Open source data-diff is fast and provides complete validation.
Open Source data-diff for Building and Managing Data Quality
Available today, data-diff uses checksums to verify 100% consistency between two different data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be done in just a few seconds, without sacrificing the granularity of the resulting comparison.
Datafold has released data-diff under the MIT license. Currently, the software includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto and Oracle. Datafold plans to invite contributors to build connectors for additional data sources and for specific business applications.
Datafold is a data reliability platform that helps data teams deliver reliable data products faster. It has a unique ability to identify, prioritize and investigate data quality issues proactively before they affect production. Founded in 2020 by veteran data engineers, Datafold has raised $22 million from investors including NEA, Amplify Partners, and YCombinator. Customers include Thumbtack, Patreon, Truebill, Faire, and Dutchie.