apache · alamb · Oct 23, 2021 · Oct 22, 2021 · Oct 22, 2021 · Oct 22, 2021
diff --git a/docs/source/specification/roadmap.md b/docs/source/specification/roadmap.md
@@ -92,8 +92,28 @@ Note: There are some additional thoughts on a datafusion-cli vision on [#1096](h
 - publishing to apt, brew, and possible NuGet registry so that people can use it more easily
 - adopt a shorter name, like dfcli?
 
-## Ballista
+# Ballista
 
-# Vision
+Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
+breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
+in the cluster.
 
-TBD
+Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
+compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
+remain language-agnostic so that executors can be built in languages other than Rust.
+
+## Ballista Roadmap
+
+## Move query scheduler into DataFusion
+
+The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
+the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
+configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
+DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.
+
+## Implement execution-time cost-based optimizations based on statistics
+
+After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
+could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
+it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
+be smaller until execution time.