Ballista is a proof-of-concept distributed compute platform primarily implemented in Rust, powered by Apache Arrow. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.
With the release of Apache Arrow 3.0.0 there were many breaking changes in the Rust implementation (for good reason) and as a result it has been necessary to re-implement the Rust executor and this work is ongoing.
The current plan is to release version 0.4.0 once the following items are completed.
- Compile the Rust implementation against Arrow 3.0.0
- Get TPC-H benchmarks working against a single Ballista executor
- Get TPC-H benchmarks working with distributed execution against a Ballista cluster
To follow the progress of this work, please refer to the "This Week in Ballista" blog.
For the latest stable version of Ballista, see branch-0.3.
The foundational technologies in Ballista are:
- Apache Arrow memory model and compute kernels for efficient processing of data.
- Apache Arrow Flight Protocol for efficient data transfer between processes.
- Google Protocol Buffers for serializing query plans.
- Docker for packaging up executors along with user-defined code.
Ballista can be deployed in Kubernetes, or as a standalone cluster using etcd for discovery.
The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.
Although Ballista is largely inspired by Apache Spark, there are some key differences.
- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
The following examples should help illustrate the current capabilities of Ballista
Ballista releases are now available on crates.io, Maven Central and Docker Hub. Please refer to the user guide for instructions on using a released version of Ballista.
The user guide is hosted at https://ballistacompute.org, along with the blog where news and release notes are posted.
Developer documentation can be found in the docs directory.
See CONTRIBUTING.md for information on contributing to this project.