Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: define sideband optimization hints #705

Merged
merged 13 commits into from
Oct 8, 2024
49 changes: 49 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,12 @@ message RelCommon {

substrait.extensions.AdvancedExtension advanced_extension = 10;

// Save or load a system-specific computation for use in optimizing a remote operation.
// The anchor refers to the source/destination of the computation. The computation type
// and number refer to the current relation.
repeated SavedComputation saved_computations = 11;
repeated LoadedComputation loaded_computations = 12;

// The statistics related to a hint (physical properties of records)
message Stats {
double row_count = 1;
Expand All @@ -59,6 +65,49 @@ message RelCommon {

substrait.extensions.AdvancedExtension advanced_extension = 10;
}

enum ComputationType {
COMPUTATION_TYPE_UNSPECIFIED = 0;
COMPUTATION_TYPE_HASHTABLE = 1;
COMPUTATION_TYPE_BLOOM_FILTER = 2;
COMPUTATION_TYPE_UNKNOWN = 9999;
jacques-n marked this conversation as resolved.
Show resolved Hide resolved
}

message SavedComputation {
// The value corresponds to a plan unique number for that datastructure. Any particular
// computation may be saved only once but it may be loaded multiple times.
int32 computation_id = 1;
// The type of this computation. While a plan may use COMPUTATION_TYPE_UNKNOWN for all
// of its types it is recommended to use a more specific type so that the optimization
// is more portable. The consumer should be able to decide if an unknown type here
// matches the same unknown type at a different plan and ignore the optimization if they
// are mismatched.
ComputationType type = 2;
// The instance of the given computation type on this relation. For instance, a 2 here
// with computation type COMPUTATION_TYPE_BLOOM_FILTER refers to the second bloom filter.
// The local system can use any numbering system it wants but for better compatibility
// it is suggested to refer to computations in order of the input that they are derived
// from. Computation numbers start at 1.
int32 number = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle with the use of this. Anchor seems sufficient to have referencing between locations and if we're trying to identify engine meaningful, I think better to just allow a hint any value on the computation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines a sharing group. Because there is a potentially one to many relationship for every computation we need some way of defining which group a particular reference is from.

Copy link
Contributor

@jacques-n jacques-n Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it.

I think you're saying that a single rel could save, let's say, two hashtables. And five operators could consume all/some/none of those two hash tables. I think anchor is more than enough for that. Why do we also need number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you know if the first hashtable or the second hashtable was supposed to be output for that computation group? The same goes for the consuming side -- how do you know if the that computation is destined for the first or the second hashtable slot? Anchor (or computation group id) tells you which relations are consuming it and not where in the relation it is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main thought is the "how the number maps to an internal structure" is opaque to consumers, whether it is one number or two. Whether we have anchor1 and anchor2 as the left and right or anchor1,number1 and anchor1,number2 doesn't impact that the mapping is hidden/implicit.

For now I suggest that we just eliminate the second number.

}

message LoadedComputation {
// The value corresponds to a plan unique number for that datastructure. Any particular
// computation may be saved only once but it may be loaded multiple times.
int32 computation_id_reference = 1;
// The type of this computation. While a plan may use COMPUTATION_TYPE_UNKNOWN for all
// of its types it is recommended to use a more specific type so that the optimization
// is more portable. The consumer should be able to decide if an unknown type here
// matches the same unknown type at a different plan and ignore the optimization if they
// are mismatched.
ComputationType type = 2;
// The instance of the given computation type on this relation. For instance, a two here
// with computation type COMPUTATION_TYPE_BLOOM_FILTER refers to the second bloom filter.
// The local system can use any numbering system it wants but for better compatibility
// it is suggested to refer to computations in order of the input that they are derived
// from. Computation numbers start at 1.
int32 number = 3;
}
}
}

Expand Down
3 changes: 2 additions & 1 deletion site/docs/relations/_config
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
arrange:
- basics.md
- common_fields.md
- logical_relations.md
- physical_relations.md
- user_defined_relations.md
- embedded_relations.md
- embedded_relations.md
28 changes: 28 additions & 0 deletions site/docs/relations/common_fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Common Fields

Every relation contains a common section containing optional hints and emit behavior.


## Emit

A relation which has a direct emit kind outputs the relation's output without reordering or selection. A relation that specifies an emit output mapping can output its output columns in any order and may leave output columns out.

???+ info "Relation Output"

* Relations by default provide as their output the list of all of its input columns plus any generated columns as its output columns. One notable exception is aggregations which only output new columns.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved


## Hints

Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category.

???+ info "Hint Design"

* If a hint is not present or has incorrect data the consumer should be able to arrive at the correct result.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved


### Saved Computations

Computations can be used to save on data structure to use elsewhere. For instance, let's say we have a plan with a HashEquiJoin and an AggregateDistinct operation. The HashEquiJoin could save its hash table as part of saved computation id number 1 and the AggregateDistinct could read in computation id number 1.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

Now let's try a more complicated example. We have a relation that has constructs two hash tables and we'd like one of them to go to our aggregate relation still but the other to go elsewhere. We can use the computation number to select which data structure goes where. For instance computation number 1 could be hash table number 1 and computation number 2 could be hash table number 2. The reciving entity just needs to know which of its data structures it needs to put that computation in. So if it has 5 hash table datastructures the LoadedComputation record needs to point to the number that it intends for that incoming data to go.
Loading