Alluxio, the developer of open source data orchestration software for large-scale workloads, today announced the immediate availability of version 2.7 of its Data Orchestration Platform. This new release has led to 5x improved I/O efficiency for Machine Learning (ML) training at significantly lower cost by parallelizing data loading, data preprocessing and training pipelines. Alluxio 2.7 also provides enhanced performance insights and support for open table formats like Apache Hudi and Iceberg to more easily scale access to data lakes for faster Presto and Spark-based analytics.
“Alluxio 2.7 further strengthens Alluxio’s position as a key component for AI, Machine Learning, and deep learning in the cloud. With the age of growing datasets and increased computing power from CPUs and GPUs, machine learning and deep learning have become popular techniques for AI. This rise of these techniques advances the state-of-the-art for AI, but also exposes some challenges for the access to data and storage systems.”
Haoyuan Li, Founder and CEO, Alluxio
"We deployed Alluxio in a cluster of 1000 nodes to accelerate the data preprocessing of model training on our game AI platform. Alluxio has proven to be stable, scalable and manageable,” said Peng Chen, Engineer Manager in the big data team at Tencent. “As more and more big data and AI applications are containerized, Alluxio is becoming the top choice for large organizations as an intermediate layer to accelerate data analytics and model training."
“Data teams with large-scale analytics and AI/ML computing frameworks are under increasing pressure to make a growing number of data sources more easily accessible, while also maintaining performance levels as data locality, network IO, and rising costs come into play,” said Mike Leone, Analyst, ESG. “Organizations want to use more affordable and scalable storage options like cloud object stores, but they want peace of mind knowing they don’t have to make costly application changes or experience new performance issues. Alluxio is helping organizations address these challenges by abstracting away storage details while bringing data closer to compute, especially in hybrid cloud and multi-cloud environments.”
Alluxio 2.7 Community and Enterprise Edition features new capabilities, including:
Alluxio and NVIDIA’s DALI for ML
NVIDIA’s Data Loading Library (DALI) is a commonly used python library which supports CPU and GPU execution for data loading and preprocessing to accelerate deep learning. With release 2.7, the Alluxio platform has been optimized to work with DALI for python-based ML applications which include a data loading and preprocessing step as a precursor to model training and inference. By accelerating I/O heavy stages and allowing parallel processing of the following compute intensive training, end-to-end training on the Alluxio data platform achieves significant performance gains over traditional solutions. The solution is scale-out as opposed to other solutions suitable for smaller data set sizes.
Data Loading at Scale
At the heart of Alluxio’s value proposition is data management capabilities complimenting caching and unification of disparate data sources. As the use of Alluxio has grown for compute and storage spanning multiple geographical locations, the software continues to evolve to keep scaling using a new technique for batching data management jobs. Batching jobs, performed using an embedded execution engine for tasks such as data loading, reduces the resource requirements for the management controller lowering cost of provisioned infrastructure.
Ease of Use on Kubernetes
Alluxio now supports a native Container Storage Interface (CSI) Driver for Kubernetes, as well as a Kubernetes operator for ML making it easier than ever before to operate ML pipelines on the Alluxio platform in containerized environments. The Alluxio volume type is now natively available for Kubernetes environments. Agility and ease-of-use are a constant focus in this release.
Insight Driven Dynamic Cache Sizing for Presto
An intelligent new capability, called Shadow Cache, makes striking the balance between high performance and cost easy by dynamically delivering insights to measure the impact of cache size on response times. For multi-tenant Presto environments at scale, this new feature significantly reduces the management overhead with self-managing capabilities.
“Data platform teams utilize Alluxio to streamline data preprocessing and loading phases in a world where storage is separated from ML computation,” said Adit Madan, Senior Product Manager, Alluxio. “This simplicity enables maximum utilization of GPUs with frameworks such as Spark ML, Tensorflow and PyTorch. The Alluxio solution is available on multiple cloud platforms such as AWS, GCP, and Azure Cloud, and now also on Kubernetes in private data centers or public clouds.”
Proven at global web scale in production for modern data services, Alluxio is the developer of open source data orchestration software for the cloud. Alluxio moves data closer to data analytics and machine learning compute frameworks in any cloud across clusters, regions, clouds and countries, providing memory-speed data access to files and objects. Intelligent data tiering and data management deliver consistent high performance to customers in financial services, high tech, retail and telecommunications. Alluxio is in production use today at eight out of the top ten internet companies. Venture-backed by Andreessen Horowitz, Seven Seas Partners and Volcanics Ventures. Alluxio was founded at UC Berkeley’s AMPLab by the creators of the Tachyon open source project.