Data Architecture

Dremio Continues to Reduce the Zone of Confusion Between Data Lakes and Data Warehouses with New Dart Initiative Release

Dremio, the SQL Lakehouse Platform company, today achieved another milestone in closing the gap between cloud data lakes and cloud data warehouses. Today’s release marks the second delivery in the company’s Dart Initiative, which enables customers to run all mission-critical SQL workloads directly on the cloud data lake.

Dremio embarked on the Dart Initiative in June 2021 to help companies run a greater range of mission-critical BI workloads directly on the data lake, delivering over 2x faster performance and drastically improved resource efficiency over previous Dremio versions. This subsequent Dart Initiative release introduces several more enhancements, including over 5x faster SQL expression processing over previous versions.

According to the 2020 Gartner® Market Guide for Analytics Query Accelerators report, “Analytics query accelerators seek to shrink the performance impact of the zone of confusion. Put another way, they are trying to move the “line of good enough” to the point where the data lake can provide sufficient optimization on the data to make it suitable for an increasing percentage of workloads.”1 With the Dart Initiative, Dremio seeks to leapfrog a "good enough" notion of data lakes, and make them the clear and obvious choice for BI and analytics workloads in the enterprise.

“It’s clear that the data lake can already support BI workloads of the most mission-critical nature. Three of the Fortune Five companies that already run Dremio in production today are doing just that. We want to push the boundaries of what’s possible in the data lakehouse and deliver the best BI experience for our customers. To that end, the Dart Initiative has been chipping away at the Zone of Confusion between data lakes and warehouses in critical areas such as query performance and acceleration, SQL coverage, and transactionality.”

Tomer Shiran, founder and Chief Product Officer at Dremio

Here are some of the key innovations of the Dremio Dart Initiative Fall 2021 release.

Scale-out Metadata Collection and Storage

Achieving near-instantaneous query startup times has been out of reach for traditional query engines, which must perform a significant amount of work to parse, plan, and gather dataset metadata for each query before it can be executed. In contrast, Dremio enables interactive performance directly on data lake storage by drastically reducing the amount of computation required at runtime. Dremio’s ability to efficiently compute, store, and leverage metadata plays a major role in enabling this.

This Dart Initiative release delivers near real-time metadata refresh for datasets, ensuring users are leveraging the most current or near real-time version of data, and receiving timely visibility into recent schema and data changes. Dremio has achieved data freshness through carefully refactoring metadata processing to become a parallel, executor-based process, with metadata now stored and managed in Apache Iceberg tables.

Parallelizing metadata processing across executors and leveraging capabilities and best practices from Iceberg makes all metadata operations much faster and more scalable, and in turn gives rise to a variety of benefits for users. In addition to the benefits mentioned, this enhanced metadata management approach enables Dremio to deliver metadata refresh times up to 20x faster than previous versions of Dremio, while governing them with the same workload management capabilities as queries, such as engine routing, priority, and concurrency controls. As demonstrated in Figure 1, performance improves as the dataset size increases. Data freshness effectively leads to more accurate insights and business decisions for enterprises across a variety of use cases, including customer experience and loyalty, marketing campaign optimization, operational efficiency, and customer 360.

Hardware-Optimized Query Processing

Dremio is an in-memory engine powered by Apache Arrow2, an open source columnar standard for in-memory computing that was co-created by Dremio. Gandiva, a component of Arrow, is an LLVM-based toolkit that enables vectorized execution directly on in-memory Arrow buffers, by generating code to evaluate SQL expressions that fully leverage the pipelining and SIMD capabilities of modern CPUs. This Dart Initiative release enables Dremio to dramatically accelerate expression processing rates by over 5x, ultimately providing a significant performance increase for end users.

Expanded SQL Coverage and Data Lakehouse Support

The Summer 2021 Dremio Dart Initiative empowered companies to run an even broader set of enterprise SQL workloads on Dremio by vastly expanding SQL coverage to include additional functions, operators, and SQL grammar constructs. The Fall 2021 Dart Initiative release extends the SQL coverage introduced through the prior Dart release, with functions such as Pivot/Unpivot and filtered aggregates. Risk analysis in insurance, maximizing revenue in travel and transportation, improving clinical trials in pharma, and enabling credit risk assessment in banking are among the many use cases that benefit from the expanded SQL coverage via this Dart release.

Aside from broadening the scope of SQL workloads, this Dart release also expands Dremio’s support for open-source table formats. Table formats, such as Apache Iceberg and Delta Lake, enable companies to perform inserts, updates, and deletes with transactional consistency, and time travel, directly on data lake storage. Table formats have surged in popularity as these features were previously only supported by data warehouses. With this release, companies can now run interactive BI workloads on both of the leading lakehouse table formats, Apache Iceberg and Delta Lake.


Other News

Dom Nicastro | April 03, 2020

Read More

Dom Nicastro | April 03, 2020

Read More

Dom Nicastro | April 03, 2020

Read More

Dom Nicastro | April 03, 2020

Read More