Databricks, the data and AI business that pioneered the data lakehouse concept, has unveiled data lineage for Unity Catalog, dramatically increasing the lakehouse's data governance capabilities. The movement of data within an organization is described by data lineage. Customers can obtain insight into where data in their lakehouse originated from, who developed it and when, how it has been amended over time, how it is being utilized, and much more by using this new Unity Catalog functionality. Unity Catalog data lineage is now available for preview on AWS and Microsoft Azure.
Organizations deal with a flood of data from numerous sources, and it's very difficult to understand where that data originated from, how it's moving and changing, who has access to it, and how it's being utilized. However, having that knowledge is critical for establishing trust and assessing risk. With data lineage for Unity Catalog, data teams can view all of the downstream consumers affected by data changes – apps, dashboards, machine learning models, or data sets, for example – and rapidly assess the degree of the effect and alert the appropriate stakeholder of changes.
Data lineage enables data consumers, like data scientists, data engineers, and data analysts, to undertake context-aware analysis, leading to higher-quality results. Moreover, data stewards can detect which data sets are no longer used or have become outdated in order to retire superfluous data, lowering risk and ensuring end users only consume high-quality data. The new Unity Catalog features provide enterprises with a comprehensive picture of the full data lifecycle, allowing data executives to understand how data is gathered, if it has been updated, and the methods employed.
"Governance capabilities such as data lineage are critical as we work to build the industry's most robust lakehouse platform. Without good data lineage, it is challenging to track the business and verification processes that data-driven organizations need to be successful. Our goal is to ensure our customers can focus on insights, and move toward proactive data management practices through a unified, transparent view of their entire data ecosystem."
Matei Zaharia, Co-Founder and Chief Technologist at Databricks
Unity Catalog's key features involve automatic run-time lineage, which captures every lineage created in Databricks, giving more accuracy and efficiency than manually tagging data. This data is collected for tables, views, and columns to provide a detailed picture of upstream and downstream data flows. Lineage also works across all Databricks workloads, including SQL, Python, R, and Scala, enabling any data personas to enrich their tools with data intelligence and superior insights. This involves tracking the history of entries such as notebooks, processes, and dashboards.