From c1553dfafa09de1b2441cdb1d22a251a675419a7 Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Thu, 1 Aug 2024 16:47:43 -0700 Subject: [PATCH] feat: clarify the meaning of plans (#616) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #612 and #613 highlighted a number of ambiguities in plan interpretation. This should address those ambiguities. Co-authored-by: Ingo Müller --- site/docs/relations/basics.md | 12 +++++++++--- site/docs/serialization/_config | 6 ++++-- site/docs/serialization/basics.md | 25 +++++++++++++++++++++++++ 3 files changed, 38 insertions(+), 5 deletions(-) create mode 100644 site/docs/serialization/basics.md diff --git a/site/docs/relations/basics.md b/site/docs/relations/basics.md index 41e86f425..e940741a8 100644 --- a/site/docs/relations/basics.md +++ b/site/docs/relations/basics.md @@ -1,6 +1,14 @@ # Basics -Substrait is designed to allow a user to construct an arbitrarily complex data transformation plan. The plan is composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations. +Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations. + +## Plans + +A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan. + +Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph). + +## Relational Operators Each relational operation is composed of several properties. Common properties for relational operations include the following: @@ -10,8 +18,6 @@ Each relational operation is composed of several properties. Common properties f | Hints | A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. | Physical | | Constraint | A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. | Physical | - - ## Relational Signatures In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions. diff --git a/site/docs/serialization/_config b/site/docs/serialization/_config index 30dd0ee96..8642e6b77 100644 --- a/site/docs/serialization/_config +++ b/site/docs/serialization/_config @@ -1,3 +1,5 @@ arrange: - - binary_serialization.md - - text_serialization.md + +- basics.md +- binary_serialization.md +- text_serialization.md diff --git a/site/docs/serialization/basics.md b/site/docs/serialization/basics.md new file mode 100644 index 000000000..093719df4 --- /dev/null +++ b/site/docs/serialization/basics.md @@ -0,0 +1,25 @@ +# Basics + +Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for +transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future. + +These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted. +For example, the following scenarios are all valid uses of a collection of plans: + +- A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The + top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans + which are referenced from the root plan. +- A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then + it could output a serialized binary containing multiple root plans. Each root plan is a representation of the + input plan in a different dialect. +- A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation. + +Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed. + +## Root plans + +We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we +encode this graph as a collection of trees with a single root tree that references other trees (which may also +transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are "root" +plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan +can safely be ignored.