BIG DATA MANAGEMENT

Datafold Launches Open Source data-diff to Compare Tables of Any Size Across Databases

Datafold | June 23, 2022

Datafold
Datafold, a data reliability company, today announced data-diff, a new open source cross-database diffing package. This new product is an open source extension to Datafold’s original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.

In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases. Datafold has been focused on automated testing during the transformation step with Data Diff, ensuring that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data. With the launch of open source data-diff, Datafold can now help with the extract and load part of the process. Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the extract, load, transform (ELT) process.

“data-diff fulfills a need that wasn’t previously being met. “Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process, and although multiple vendors and open source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.

Gleb Mezhanskiy, Datafold founder and CEO

Mezhanskiy continued, “data-diff solves this problem elegantly by providing an easy way to validate consistency of data sets across databases at scale. It relies on state-of-the art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open source tool, it can be easily embedded into existing workflows and systems.”

Answering an Important Need

Today’s organizations are using data replication to consolidate information from multiple sources into data warehouses or data lakes for analytics. They’re integrating operational systems with real-time data pipelines, consolidating data for search, and migrating data from legacy systems to modern databases.

Thanks to amazing tools like Fivetran, Airbyte and Stitch, it’s easier than ever to sync data across multiple systems and applications. Most data synchronization scenarios call for 100% guaranteed data integrity, yet the practical reality is that in any interconnected system, records are sometimes lost due to dropped packets, general replication issues, or configuration errors. To ensure data integrity, it’s necessary to perform validation checks using a data diff tool.

Datafold’s approach constitutes a significant step forward for developers and data analysts who wish to compare multiple databases rapidly and efficiently, without building a makeshift diff tool themselves. Currently, data engineers use multiple comparison methods, ranging from simple row counts to comprehensive row-level analysis. The former is fast but not comprehensive, whereas the latter approach is slow but guarantees complete validation. Open source data-diff is fast and provides complete validation.

Open Source data-diff for Building and Managing Data Quality

Available today, data-diff uses checksums to verify 100% consistency between two different data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be done in just a few seconds, without sacrificing the granularity of the resulting comparison.

Datafold has released data-diff under the MIT license. Currently, the software includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto and Oracle. Datafold plans to invite contributors to build connectors for additional data sources and for specific business applications.

About Datafold
Datafold is a data reliability platform that helps data teams deliver reliable data products faster. It has a unique ability to identify, prioritize and investigate data quality issues proactively before they affect production. Founded in 2020 by veteran data engineers, Datafold has raised $22 million from investors including NEA, Amplify Partners, and YCombinator. Customers include Thumbtack, Patreon, Truebill, Faire, and Dutchie.

Spotlight

Consumers expect to be served real-time recommendations customized to their needs. Retailers like Walmart, eBay, and adidas power their recommendations with a graph database and see performance thousands of times faster than with a MySQL database. Read this white paper to learn how graph-powered recommender systems can give you


Other News
BIG DATA MANAGEMENT

Nordisk Film Adopts Qlik Cloud Analytics for Operational Efficiencies and Reduced Costs

Qlik | August 10, 2022

Qlik® today announced Nordisk Film, the Nordic region's leading creator and distributor of films, has adopted Qlik Cloud® Analytics to realize operational efficiencies, reduce costs associated with data preparation and analysis, and ultimately expand data-decision making throughout the organization. Nordisk Film is known as an industry innovator at the forefront of adopting modern technology and solutions that foster new, improved ways of working. Over the last five years, Nordisk Film has been on a journey to deploy strategies that meet modern consumer demands while leveraging fact-based decisions. Nordisk Film’s prior business intelligence systems required significant maintenance, service, and dedicated storage space and memory. This made the system difficult for users to work with and limited Nordisk’s ability to scale the use of data for decisions. Nordisk Film was also looking to increase collaboration and streamline its different computer systems and data warehouses into a single cloud platform. Nordisk Film chose to move to Qlik Cloud for importing, clearing and analyzing data in order to make more informed business decisions while leveraging Qlik Sense® SaaS. “Previously we spent a lot of time with maintenance and making sure our internal systems worked. “We migrated our old local platform to a Qlik environment, saving us time and allowing us to take advantage of the latest technical developments and improve our data structure for a more mature approach to analytics.” Mikkel Hecht Hansen, Head of BI at Nordisk Film For Nordisk Film, it is important to have a cost-effective and scalable platform, while also being able to leverage modern analytics capabilities such as augmented analytics and mobile access. “Qlik has given us a completely different dimension of new knowledge and opportunities through, among other things, Insight Advisor. Along with an easy and simple security login through Azure AD, Qlik gives us many insights and data-driven facts that help us make better decisions,” said Hansen. Insight Advisor is the AI assistant built into Qlik Sense that generates advanced analytics and insights using natural language interaction for Nordisk analytics users. Another key innovation in Qlik’s platform that is bringing value to Nordisk is Collaborative Notes, which allows employees to comment or write longer reports directly in the analytics environment. And Qlik being easy-to-learn and applicable to many different business areas has helped Nordisk Film expand analytics adoption across the business. “Nordisk Film is a great example of an incredible brand that is leveraging Qlik Cloud to accelerate its transformation into a data-driven business,” said Francisco Mateo-Sidron, Senior Vice President EMEA for Qlik. “We look forward to helping Nordisk continue to expand its ability to leverage cloud analytics for impact throughout the entire organization.” About Nordisk Film Nordisk Film is a leading Nordic entertainment and experience company focused on storytelling across platforms. We produce, market and distribute film and series, operate a leading Nordic cinema chain, is behind global game studios and PlayStation in the Nordics, and delivers digital gift card solutions to the world. Nordisk Film is a part of the leading Nordic media group Egmont, together with Story House Egmont, TV 2 in Norway, Lindhardt og Ringhof og Cappelen Damm. Egmont is a foundation, and all profits are used to develop media and to help children and young people and support film talents. We bring stories to life. About Qlik Qlik’s vision is a data-literate world, where everyone can use data and analytics to improve decision-making and solve their most challenging problems. A private company, Qlik offers real-time data integration and analytics solutions, powered by Qlik Cloud®, to close the gaps between data, insights and action. By transforming data into Active Intelligence, businesses can drive better decisions, improve revenue and profitability, and optimize customer relationships. Qlik serves more than 38,000 active customers in over 100 countries.

Read More

DATA SCIENCE

ForMotiv Joins Guidewire Insurtech Vanguards Program to Bring Behavioral Data Science to Guidewire Carriers

ForMotiv | June 13, 2022

ForMotiv, a leading behavioral data science and intent scoring platform, announced that the company has joined Guidewire’s Insurtech Vanguards program, an initiative led by property and casualty (P&C) cloud platform provider Guidewire (NYSE: GWRE), to help insurers learn about the newest insurtechs and how to best leverage them. “We’re excited about this new partnership with Guidewire to continue to expand our reach in insurance,” said Bill Conners, CEO of ForMotiv. “We have worked hard for four-plus years establishing a footprint with leading carriers – and are excited to be a part of Guidewire’s Insurtech Vanguards program.” Insurtech Vanguards is a community of select startups and technology providers that are bringing novel solutions to the P&C industry. As part of the program, Guidewire provides strategic guidance to and advocates for the participating insurtechs, while connecting them with Guidewire’s P&C customers. “ForMotiv’s digital behavioral intelligence solution leverages machine learning to produce data analytics and scoring, enabling insurers to quickly view users’ clickstreams on their apps, which can accelerate underwriting and claims analytics. “We are thrilled to welcome ForMotiv and its innovative technology to our program so our mutual customers can raise the bar in leveraging their user data.” Laura Drabik, chief evangelist, Guidewire With its industry-leading behavioral data capture and intent scoring solution, ForMotiv works with carriers to help them analyze and monitor customer and agent digital behavior while accurately predicting user intent in real-time. ForMotiv provides robust behavioral reporting and analytics as well as a granular behavioral dataset leverageable across multiple departments. Armed with instant intent scoring and deterministic behavioral signaling, carriers can confidently predict buying intent, identify risk and nondisclosure, and expand their accelerated underwriting offerings to genuine users while dynamically intervening on applications requiring further qualification. ForMotiv’s real-time predictive behavioral analytics enable next-generation dynamic experiences, or SmartApps, that adapt to individual users based on their behavior. Its suite of products ranges from robust data capture and behavioral analytics to signaling and intent scoring. Carriers can leverage ForMotiv’s expansive behavioral dataset for both offline and real-time use cases. About ForMotiv ForMotiv is the only Behavioral Science Platform on the market that enables leading insurance companies to accurately and instantly predict user intent, in real-time. Our solution helps carriers improve digital customer & agent experiences, increase conversions, reduce risk & fraud, and more by analyzing users' digital body language (consisting of thousands of behavioral micro-expressions i.e. keystrokes, mouse movements, hesitation, corrections, copy/paste, and 150+ additional user engagement signals) while users engage with digital applications and claims forms to identify genuine, confused, risky or other behavior. Armed with real-time intent intelligence, ForMotiv carrier customers create next-generation dynamic experiences that adapt to individual users based on their intent. ForMotiv works with Marketing, Risk, Fraud, Data Science, Underwriting, Digital Strategy, and Claims teams.

Read More

BIG DATA MANAGEMENT

Komprise Automates Unstructured Data Discovery with Smart Data Workflows

Komprise | May 20, 2022

Komprise, the leader in analytics-driven unstructured data management and mobility, today announced Komprise Smart Data Workflows, a systematic process to discover relevant file and object data across cloud, edge and on-premises datacenters and feed data in native format to AI and machine learning (ML) tools and data lakes. Industry analysts predict that at least 80% of the world’s data will be unstructured by 2025. This data is critical for AI and ML-driven applications and insights, yet much of it is locked away in disparate data storage silos. This creates an unstructured data blind spot, resulting in billions of dollars in missed big data opportunities. Komprise has expanded Deep Analytics Actions to include copy and confine operations based on Deep Analytics queries, added the ability to execute external functions such as running natural language processing functions via API and expanded global tagging and search to support these workflows. Komprise Smart Data Workflows allow you to define and execute a process with as many of these steps needed in any sequence, including external functions at the edge, datacenter or cloud. Komprise Global File Index and Smart Data Workflows together reduce the time it takes to find, enrich and move the right unstructured data by up to 80%. “Komprise has delivered a rapid way to visualize our petabytes of instrument data and then automate processes such as tiering and deletion for optimal savings,” says Jay Smestad, senior director of information technology at PacBio. “Now, the ability to automate workflows so we can further define this data at a more granular level and then feed it into analytics tools to help meet our scientists’ needs is a game changer.” Komprise Smart Data Workflows are relevant across many sectors. Here’s an example from the pharmaceutical industry: 1) Search: Define and execute a custom query across on-prem, edge and cloud data silos to find all data for Project X with Komprise Deep Analytics and the Komprise Global File Index. 2) Execute & Enrich: Execute an external function on Project X data to look for a specific DNA sequence for a mutation and tag such data as "Mutation XYZ". 3) Cull & Mobilize: Move only Project X data tagged with "Mutation XYZ" to the cloud using Komprise Deep Analytics Actions for central processing. 4) Manage Data Lifecycle: Move the data to a lower storage tier for cost savings once the analysis is complete. Other Smart Data Workflow use cases include: Legal Divestiture: Find and tag all files related to a divestiture project and move sensitive data to an object-locked storage bucket and move the rest to a writable bucket. Autonomous Vehicles: Find crash test data related to abrupt stopping of a specific vehicle model and copy this data to the cloud for further analysis. Execute an external function to identify and tag data with Reason = Abrupt Stop and move only the relevant data to the cloud data lakehouse to reduce time and cost associated with moving and analyzing unrelated data. “Whether it’s massive volumes of genomics data, surveillance data, IoT, GDPR or user shares across the enterprise, Komprise Smart Data Workflows orchestrate the information lifecycle of this data in the cloud to efficiently find, enrich and move the data you need for analytics projects. “We are excited to move to this next phase of our product journey, making it much easier to manage and mobilize massive volumes of unstructured data for cost reduction, compliance and business value.” Kumar Goswami, CEO of Komprise About Komprise Komprise is a provider of unstructured data management and mobility software that frees enterprises to easily analyze, mobilize, and monetize the right file and object data across clouds without shackling data to any vendor. With Komprise Intelligent Data Management, you can cut 70% of enterprise storage, backup and cloud costs while making data easily available to cloud-based data lakes and analytics tools.

Read More

BUSINESS STRATEGY

Atlan Named a Leader in Enterprise Data Catalogs for DataOps Evaluation by Independent Research Firm

Atlan | June 24, 2022

Atlan, the active metadata platform for modern data teams, today announced that it has been recognized as a leader in The Forrester Wave™: Enterprise Data Catalogs for DataOps, Q2 2022. Atlan received the highest score in the current offering and strategy categories. According to the report, “Atlan is the tool of choice for DataOps and data product deployment. Atlan’s vision is to create frictionless data product deployment through a single metadata and data automation platform. The tool was built by data engineers and for data engineers… As a result, Atlan maintains a strong focus for continued innovation in this metadata-driven data and application ecosystem.” The Forrester Wave™ assessment evaluated 14 of the most significant enterprise data catalogs across 26 evaluation criteria. The evaluation is a culmination of rigorous, fact-based research, with a view of the relative positions and key differentiation of the top vendors in the market. Atlan received the highest score possible in 17 criteria, including Product Vision, Market Approach, Innovation Roadmap, Performance, and Connectivity, Interoperability, and Portability. “We’re excited to be named a Leader in this Forrester Wave report. “We believe this validates our place as a pioneer of active metadata and a leader in this space.” Prukalpa Sankar, Co-founder of Atlan The Forrester Wave™ report states: “Atlan is more than metadata and data governance, standing out from the competition… Extensive integration makes data sharing easy, flexible, and scalable within hybrid distributed ecosystems for analytics and operational use cases.” Just three years since its launch, Atlan is the tool of choice for DataOps and data product deployment for a growing list of customers, including WeWork, Plaid, Postman, Scripps Health, TechStyle, SnapCommerce, Delhivery, and Belcorp. “Atlan is pushing the boundary from a DataOps standpoint,” said Venkat Gopalan, Chief Digital Officer at Belcorp. “You’re disrupting the industry by thinking differently. That’s the core essence of Atlan.” “Our platform has been designed as more than a traditional data cataloging or a data governance tool,” said Varun Banka, Co-founder of Atlan. “It was built by a data team for the evolving needs of data teams, including transparent data flow and delivery, easy-to-use experience for every data user, and open infrastructure. We can’t wait to continue innovating and bringing the latest in active metadata to modern companies around the world.” This recognition comes on the heels of other big news for Atlan — becoming the first data catalog validated as a Snowflake Ready Technology Partner; winning the MDS Rocketship Award for Data Discovery; and being named as a Gartner Cool Vendor in DataOps and in the inaugural Market Guide for Metadata Management. About Atlan Built by a data team for data teams, Atlan is the active metadata platform for modern data teams. Atlan creates a single source of truth by acting as a collaborative workspace for data teams and bringing context back into the tools where data teams live. Atlan features deep integrations across the modern data stack, including Slack, Snowflake, dbt, Redshift, Looker, Sisense, and Tableau. A pioneer in the space, Atlan was recognized by Gartner seven times in 2021, including as a Cool Vendor in DataOps and in the Inaugural Market Guide for Active Metadata Management.

Read More

Spotlight

Consumers expect to be served real-time recommendations customized to their needs. Retailers like Walmart, eBay, and adidas power their recommendations with a graph database and see performance thousands of times faster than with a MySQL database. Read this white paper to learn how graph-powered recommender systems can give you

Resources