BIG DATA MANAGEMENT
Datafold | June 23, 2022
Datafold, a data reliability company, today announced data-diff, a new open source cross-database diffing package. This new product is an open source extension to Datafold’s original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.
In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases. Datafold has been focused on automated testing during the transformation step with Data Diff, ensuring that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data. With the launch of open source data-diff, Datafold can now help with the extract and load part of the process. Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the extract, load, transform (ELT) process.
“data-diff fulfills a need that wasn’t previously being met. “Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process, and although multiple vendors and open source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.
Gleb Mezhanskiy, Datafold founder and CEO
Mezhanskiy continued, “data-diff solves this problem elegantly by providing an easy way to validate consistency of data sets across databases at scale. It relies on state-of-the art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open source tool, it can be easily embedded into existing workflows and systems.”
Answering an Important Need
Today’s organizations are using data replication to consolidate information from multiple sources into data warehouses or data lakes for analytics. They’re integrating operational systems with real-time data pipelines, consolidating data for search, and migrating data from legacy systems to modern databases.
Thanks to amazing tools like Fivetran, Airbyte and Stitch, it’s easier than ever to sync data across multiple systems and applications. Most data synchronization scenarios call for 100% guaranteed data integrity, yet the practical reality is that in any interconnected system, records are sometimes lost due to dropped packets, general replication issues, or configuration errors. To ensure data integrity, it’s necessary to perform validation checks using a data diff tool.
Datafold’s approach constitutes a significant step forward for developers and data analysts who wish to compare multiple databases rapidly and efficiently, without building a makeshift diff tool themselves. Currently, data engineers use multiple comparison methods, ranging from simple row counts to comprehensive row-level analysis. The former is fast but not comprehensive, whereas the latter approach is slow but guarantees complete validation. Open source data-diff is fast and provides complete validation.
Open Source data-diff for Building and Managing Data Quality
Available today, data-diff uses checksums to verify 100% consistency between two different data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be done in just a few seconds, without sacrificing the granularity of the resulting comparison.
Datafold has released data-diff under the MIT license. Currently, the software includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto and Oracle. Datafold plans to invite contributors to build connectors for additional data sources and for specific business applications.
Datafold is a data reliability platform that helps data teams deliver reliable data products faster. It has a unique ability to identify, prioritize and investigate data quality issues proactively before they affect production. Founded in 2020 by veteran data engineers, Datafold has raised $22 million from investors including NEA, Amplify Partners, and YCombinator. Customers include Thumbtack, Patreon, Truebill, Faire, and Dutchie.
Alluxio | May 05, 2022
Alluxio, the developer of the open source data orchestration platform for data driven workloads such as large-scale analytics and AI/ML, today announced the immediate availability of version 2.8 of its Data Orchestration Platform. This new release features enhanced interface support for the Amazon S3 REST API; security improvements for sensitive applications with strict encryption compliance and regulatory requirements; and strengthened automated data movement functionality across heterogeneous storage systems without the need to manually move or copy the data.
Alluxio 2.8 enhances S3 API compatibility, so onboarding and managing Alluxio on existing large data pipelines becomes much easier. As a major enhancement to enterprise-grade security, Alluxio 2.8 adds features to support server-side encryption capabilities for securing and governing data. Data migration is a major challenge for organizations having data stored cross-vendors, cross-clouds, or cross-regions. Alluxio 2.8 improves reliability of data movement across heterogeneous storage systems with enhanced usability of policy-based data management and high availability.
“At Uber, we run Alluxio to accelerate all sorts of business-critical analytics queries at a large scale,” said Chen Liang, Senior Software Engineer, Uber’s interactive analytics team. “Alluxio provides consistent performance in our big data processing use cases. As compute-storage separation continues to be the trend along with containerization in big data, we believe a unified layer that bridges the compute and storage like Alluxio will continue to play a key role.”
“The support for S3 API is significant, making Alluxio much more accessible to a huge number of customers. Together, our rich set of S3, HDFS and POSIX APIs enables true storage-agnostic and multi-cloud deployments. The data migration capabilities further eliminate vendor lock-in and provide organizations with the flexibility to choose whatever they want to run their applications or store their data.”
Adit Madan, Director of Product Management, Alluxio
Alluxio 2.8 Community and Enterprise Edition features new capabilities, including:
Enhanced S3 API with Metadata Tagging
Alluxio 2.8 enhances the support for the S3 RESTful API with metadata tagging capabilities. With S3 API, applications can communicate with Alluxio without a custom driver or any additional configuration. By using S3 API, data-driven applications, end-users, and admins can seamlessly and rapidly onboard Alluxio for new uses. Metadata operations can be achieved through the S3 object and bucket tagging APIs.
Data Encryption for Enterprise-grade Security
The Enterprise Edition of Alluxio 2.8 supports encryption of data in Alluxio managed storage at REST as an essential security feature. In conjunction with SSL, this new feature supports server-side encryption, ensuring data security. Alluxio now offers multiple encryption zones for data it manages to meet the demands of security. Data stored on the Alluxio worker is protected, and decrypted on the server before being sent to a client.
Data Movement Across Heterogeneous Storage Systems
Alluxio 2.8 Enterprise Edition also improves the policy-based data management features, which facilitate data access and movement between heterogeneous storage systems for better performance and lower cost. Alluxio manages data placement between the different storage systems with predefined policies. With transparent storage options, organizations can choose whichever storage best suits their needs without undergoing complex manual data migrations.
Proven at global web scale in production for modern data services, Alluxio is the developer of open source data orchestration software for the cloud. Alluxio moves data closer to data analytics and machine learning compute frameworks in any cloud across clusters, regions, and clouds, providing memory-speed data access to files and objects. Intelligent data tiering and caching deliver greater performance and reliability to customers in financial services, high tech, retail and telecommunications. Alluxio is in production use today at eight out of the top ten internet companies. Venture-backed by Andreessen Horowitz, Seven Seas Partners, Volcanic Ventures, and Hillhouse Capital. Alluxio was founded at UC Berkeley’s AMPLab by the creators of the Tachyon open source project.
BIG DATA MANAGEMENT
Palantir Technologies | February 14, 2022
Palantir Technologies Inc. , a leading builder of operating systems for the modern enterprise, today announced a new one year extended partnership with Ferrari to bring its world class data and analytics technology to Scuderia Ferrari. In addition to Scuderia Ferrari continuing to use Palantir’s Foundry platform to propel data driven performance decisions across the team’s Power Unit, Palantir will become an official Team Partner of Scuderia Ferrari, significantly expanding its past relationship.
Specifically, Scuderia Ferrari’s Power Unit department ensures that the race cars deliver optimal performance without exceeding their structural capabilities. Foundry will enable power unit engineers on the racetrack and in the Maranello Factory to optimize performance, and rapidly analyze data from sources such as Grand Prix data, test bench results, and part information.
Since 2016, Scuderia Ferrari has used the Foundry platform to help their technical teams make faster, data driven decisions around car performance, development, and reliability. In this new, expanded agreement, Scuderia Ferrari Power Unit engineers will use the Foundry platform to rapidly integrate and analyze large volumes of information from a wide array of sources to allow team members to swiftly make critical decisions and instantly share their analyses across the organization.
“We are pleased to be extending and expanding our partnership with Palantir, with whom we share common values of relentlessly pursuing technological innovation and the desire to continually improve,” said Mattia Binotto, Team Principal and Managing Director for Scuderia Ferrari. “Data analysis plays a vital role in Formula 1 and being able to count on an excellent partner such as Palantir can make all the difference. Tasks that just a few years ago would take several minutes of calculation can now be carried out in a few seconds, thanks to solutions that have been in use through this partnership.”
A typical Formula 1 race season can often generate as much as 1.5 trillion data points. This information needs to be quickly integrated and tested to confirm hypotheses for the team to make adjustments during the race itself in response to component performance, track conditions, weather, maintenance requirements, and potential part failures.
“Scuderia Ferrari and Palantir share a commitment to operational excellence. What they have accomplished as a company and a team is legendary,” said Palantir Executive Josh Harris. “We look forward to continuing to work alongside Scuderia Ferrari, one of the most successful teams in the history of Formula 1 racing.”
As part of the partnership, the Palantir brand will be featured on the Scuderia Ferrari Formula 1 race cars and on Charles Leclerc and Carlos Sainz’s driver racesuits.
About Palantir Technologies
Palantir Technologies is a technology company that builds enterprise data platforms for use by organizations with complex and sensitive data environments. From building safer cars and planes, to discovering new drugs and combating terrorism, Palantir helps customers across the public, private, and nonprofit sectors transform the way they use their data.
BIG DATA MANAGEMENT
Arcion | April 21, 2022
Arcion today announced a partnership to bring the world’s only cloud-native, CDC-based data replication platform to Databricks. Arcion is the first partner to offer preconfigured, validated data replication for users of Databricks through that company’s new Partner Connect program.
Arcion’s product enables faster, more agile analytics and AI/ML by empowering enterprises to integrate mission-critical transactional systems with their Databricks Lakehouse in real time, at scale, and with guaranteed transactional integrity. It is the only fully managed, distributed data replication as a service on the market today, offering zero-code, zero-maintenance change data capture (CDC) pipelines that can be deployed in just minutes. It empowers data teams to move high-volume data from transactional databases like Oracle and MySQL, without a single line of code.
Partner Connect makes it possible for customers to implement Arcion’s technology directly within their Databricks Lakehouse. With just a few clicks, Partner Connect automatically configures the resources necessary to begin using streaming data pipelines. Enable real-time data ingestion with powerful pipelines between Oracle, MySQL, and Snowflake (additional sources coming soon) to the Databricks Lakehouse.
“Through Partner Connect, Arcion and Databricks are deepening our thriving relationship and working together to deliver a unified experience for our customers that offers simplicity, security, rock-solid reliability, and scale. Companies across the globe are using ML and advanced analytics to turn raw data into tangible business value, but they need the right tools to help them get there. Arcion helps companies unify their data by delivering it to Databricks, where everything is available in one place, with zero delay.”
Arcion’s CEO Gary Hagmueller
Arcion Cloud uses CDC to identify and track changes to data in transactional systems, whether they are deployed on-premise, in the cloud, or across a hybrid landscape. Arcion detects any changes made within those systems and replicates them to Databricks in real time. Capable of handling petabyte-scale integration, Arcion handles high transaction volumes easily, without adversely impacting the source system’s performance.
“Arcion’s replication for Databricks’ Lakehouse provides extraordinarily rapid time to value for analytics and AI/ML,” said Adam Conway, SVP of Products at Databricks. “By making Arcion available via Partner Connect, we’re enabling thousands of Databricks customers to discover and take advantage of Arcion’s highly scalable, efficient and flexible CDC technology. With just a few clicks, users can set up a trial account and start streaming real-time data from transactional systems to their Lakehouse.”
Fortune 500 companies around the world rely on Arcion’s distributed, CDC-based data replication solution to drive fast and accurate data insights. Arcion helps enterprises eliminate slow, brittle data pipelines and high-maintenance overheads. Break down data silos through high-volume, scalable change data capture pipelines with guaranteed transactional integrity.