-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsqlrun: add test infra to compare results of processors and operators #36081
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice work! I think it will be useful for lots of things, including the merge joiner.
It looks like it doesn't always pass yet though in case you hadn't noticed:
=== RUN TestSortChunksAgainstProcessor
[14:36:32]--- FAIL: TestSortChunksAgainstProcessor (0.03s)
[14:36:32] columnar_operators_test.go:78: different results on row 0;
[14:36:32] expected:
[14:36:32] [0 1 9]
[14:36:32] got:
[14:36:32] [5 6 9]
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)
pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):
"processor output:\n %s\ncolumnar operator output:\n %s", i, procRows.String(outputTypes), colOpRows.String(outputTypes)) } }
Don't you also need to verify that used
is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm aware of the bug and will fix it shortly.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Don't you also need to verify that
used
is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.
I'm not checking used
because colOpRows
and procRows
at this point necessarily have the same number of rows - if it weren't true, it would have been caught in the reading loop when either of the "producers" outputted a row while the other didn't. (I left a comment about this.)
At the moment, I'm imagining that the input sizes will be fairly small, so I don't think a squared algorithm is a problem. If, however, we decide to use it on large inputs, we'll need to adjust it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)
pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):
break } else { if rowProc == nil {
Since both statements in the both modify control flow, this block doesn't have to be in an else statement.
Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like
if (rowProc == nil) != (rowColOp == nil) {
//build the error string out based on the nil values of either at this point on
//different results, return error
}
if (rowProc == nil) && (rowColOp == nil) {
break
}
// both are non nil, so continue
pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):
// MakeRandIntRowsModulus constructs a numRows * numCols table where the values // are random integers in the range [0, modulus). func MakeRandIntRowsModulus(rng *rand.Rand, numRows int, numCols int, modulus int) EncDatumRows {
What's the reasoning behind having modulus in this function name?
Would something like MakeRanIntInRange
be clearer maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering, why don't we just assume all tests are anyOrder
? Correct me if I'm wrong, but I'm thinking that the clarity might outweigh the performance gained, especially on a test
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that there are operators that are expected to return the results in a precise order (for example, sorter), so then anyOrder
is false, but there are also operators that can return the results in an arbitrary order (for example, hash joiner), so then anyOrder
is true.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):
Previously, georgeutsin (George Utsin) wrote…
Since both statements in the both modify control flow, this block doesn't have to be in an else statement.
Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like
if (rowProc == nil) != (rowColOp == nil) { //build the error string out based on the nil values of either at this point on //different results, return error } if (rowProc == nil) && (rowColOp == nil) { break } // both are non nil, so continue
Thanks for the suggestion. I refactored the code, and it should now be a lot more comprehensible.
pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):
Previously, georgeutsin (George Utsin) wrote…
What's the reasoning behind having modulus in this function name?
Would something likeMakeRanIntInRange
be clearer maybe?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
pkg/sql/distsqlrun/columnar_utils_test.go, line 35 at r3 (raw file):
// anyOrder determines whether the results should be matched in order (when // anyOrder is false) or as sets (when anyOrder is true). func verifyColOperator(
While recently making some changes to the general sorter and sort chunks tests, I realized that there is a third possibility of order - "partial.". For example, when we have input with two columns, but it should be sorted only on one of the columns, when rows are equal on that single column, the order is arbitrary. Possibly, verifyColOperator
should take in an optional function that would check whether two sets of results from a processor and a columnar operator are "equivalent" - whether they both satisfy the same partial ordering. But maybe this is an overkill, and in any case, I think we should merge this guy and do a follow up to introduce the check for partial order (if we decide it's worth it).
2682223
to
eae9f77
Compare
…tors Adds test infrastructure that sets up a processor and the corresponding columnar operator (as well as necessary columnarizers and materializers), runs both paths, and checks whether the output matches. Also, adds tests for general sorter and sort chunks using the introduced infrastructure. Release note: None
@georgeutsin @jordanlewis PTAL at this guy. |
Also, I'm not sure what our policy on files names here: I know that we're calling the engine "vectorized" but the operators are "columnar". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good stuff, let's keep iterating to see if we can improve the usability/come up with the right abstractions
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... regarding the vector/column stuff, I would say that it's good to use the word column
when we're talking about data columns, so the naming in general is okay. Vectorized is what people tend to call the technique of operating on sql data in a column-at-a-time fashion.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
bors r+
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)
36081: distsqlrun: add test infra to compare results of processors and operators r=yuzefovich a=yuzefovich Adds test infrastructure that sets up a processor and the corresponding columnar operator (as well as necessary columnarizers and materializers), runs both paths, and checks whether the output matches. Also, adds tests for general sorter and sort chunks using the introduced infrastructure. Fixes: #35922. Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns `2, 0, 1` and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2. I'm not sure whether `distsqlrun` package is the appropriate place, but it was the easiest package to place in. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
Build succeeded |
Adds test infrastructure that sets up a processor and the corresponding
columnar operator (as well as necessary columnarizers and materializers),
runs both paths, and checks whether the output matches.
Also, adds tests for general sorter and sort chunks using the introduced
infrastructure.
Fixes: #35922.
Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns
2, 0, 1
and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2.I'm not sure whether
distsqlrun
package is the appropriate place, but it was the easiest package to place in.Release note: None