Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ballista roadmap #1166

Merged
merged 3 commits into from
Oct 23, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 23 additions & 3 deletions docs/source/specification/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,28 @@ Note: There are some additional thoughts on a datafusion-cli vision on [#1096](h
- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
- adopt a shorter name, like dfcli?

## Ballista
# Ballista

# Vision
Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
in the cluster.

TBD
Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this context

remain language-agnostic so that executors can be built in languages other than Rust.

## Ballista Roadmap

## Move query scheduler into DataFusion

The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.

## Implement execution-time cost-based optimizations based on statistics

After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
be smaller until execution time.