distsqlrun: add test infra to compare results of processors and operators #36081

yuzefovich · 2019-03-23T18:26:45Z

Adds test infrastructure that sets up a processor and the corresponding
columnar operator (as well as necessary columnarizers and materializers),
runs both paths, and checks whether the output matches.

Also, adds tests for general sorter and sort chunks using the introduced
infrastructure.

Fixes: #35922.

Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns 2, 0, 1 and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2.

I'm not sure whether distsqlrun package is the appropriate place, but it was the easiest package to place in.

Release note: None

cockroach-teamcity · 2019-03-23T18:26:52Z

This change is

jordanlewis

This is nice work! I think it will be useful for lots of things, including the merge joiner.

It looks like it doesn't always pass yet though in case you hadn't noticed:

=== RUN   TestSortChunksAgainstProcessor
[14:36:32]--- FAIL: TestSortChunksAgainstProcessor (0.03s)
[14:36:32]    columnar_operators_test.go:78: different results on row 0;
[14:36:32]        expected:
[14:36:32]           [0 1 9]
[14:36:32]        got:
[14:36:32]           [5 6 9]

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)

pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):

					"processor output:\n		%s\ncolumnar operator output:\n		%s", i, procRows.String(outputTypes), colOpRows.String(outputTypes))
			}
		}

Don't you also need to verify that used is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.

yuzefovich

Yes, I'm aware of the bug and will fix it shortly.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Don't you also need to verify that used is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.

I'm not checking used because colOpRows and procRows at this point necessarily have the same number of rows - if it weren't true, it would have been caught in the reading loop when either of the "producers" outputted a row while the other didn't. (I left a comment about this.)

At the moment, I'm imagining that the input sizes will be fairly small, so I don't think a squared algorithm is a problem. If, however, we decide to use it on large inputs, we'll need to adjust it.

georgeutsin

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)

pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):

			break
		} else {
			if rowProc == nil {

Since both statements in the both modify control flow, this block doesn't have to be in an else statement.

Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like

if (rowProc == nil) != (rowColOp == nil) {
    //build the error string out based on the nil values of either at this point on
    //different results, return error
}

if (rowProc == nil) && (rowColOp == nil) {
    break
}

// both are non nil, so continue

pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):

// MakeRandIntRowsModulus constructs a numRows * numCols table where the values
// are random integers in the range [0, modulus).
func MakeRandIntRowsModulus(rng *rand.Rand, numRows int, numCols int, modulus int) EncDatumRows {

What's the reasoning behind having modulus in this function name?
Would something like MakeRanIntInRange be clearer maybe?

georgeutsin

I'm just wondering, why don't we just assume all tests are anyOrder? Correct me if I'm wrong, but I'm thinking that the clarity might outweigh the performance gained, especially on a test

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)

yuzefovich

My thinking is that there are operators that are expected to return the results in a precise order (for example, sorter), so then anyOrder is false, but there are also operators that can return the results in an arbitrary order (for example, hash joiner), so then anyOrder is true.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):

Previously, georgeutsin (George Utsin) wrote…

Since both statements in the both modify control flow, this block doesn't have to be in an else statement.

Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like
if (rowProc == nil) != (rowColOp == nil) {
    //build the error string out based on the nil values of either at this point on
    //different results, return error
}

if (rowProc == nil) && (rowColOp == nil) {
    break
}

// both are non nil, so continue

Thanks for the suggestion. I refactored the code, and it should now be a lot more comprehensible.

pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):

Previously, georgeutsin (George Utsin) wrote…

What's the reasoning behind having modulus in this function name?
Would something like MakeRanIntInRange be clearer maybe?

Done.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

pkg/sql/distsqlrun/columnar_utils_test.go, line 35 at r3 (raw file):

// anyOrder determines whether the results should be matched in order (when
// anyOrder is false) or as sets (when anyOrder is true).
func verifyColOperator(

While recently making some changes to the general sorter and sort chunks tests, I realized that there is a third possibility of order - "partial.". For example, when we have input with two columns, but it should be sorted only on one of the columns, when rows are equal on that single column, the order is arbitrary. Possibly, verifyColOperator should take in an optional function that would check whether two sets of results from a processor and a columnar operator are "equivalent" - whether they both satisfy the same partial ordering. But maybe this is an overkill, and in any case, I think we should merge this guy and do a follow up to introduce the check for partial order (if we decide it's worth it).

…tors Adds test infrastructure that sets up a processor and the corresponding columnar operator (as well as necessary columnarizers and materializers), runs both paths, and checks whether the output matches. Also, adds tests for general sorter and sort chunks using the introduced infrastructure. Release note: None

yuzefovich · 2019-04-02T00:27:24Z

@georgeutsin @jordanlewis PTAL at this guy.

yuzefovich · 2019-04-02T00:29:21Z

Also, I'm not sure what our policy on files names here: I know that we're calling the engine "vectorized" but the operators are "columnar".

georgeutsin

This is good stuff, let's keep iterating to see if we can improve the usability/come up with the right abstractions

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

jordanlewis

Yeah... regarding the vector/column stuff, I would say that it's good to use the word column when we're talking about data columns, so the naming in general is okay. Vectorized is what people tend to call the technique of operating on sql data in a column-at-a-time fashion.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

yuzefovich

Thanks for the reviews!

bors r+

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

36081: distsqlrun: add test infra to compare results of processors and operators r=yuzefovich a=yuzefovich Adds test infrastructure that sets up a processor and the corresponding columnar operator (as well as necessary columnarizers and materializers), runs both paths, and checks whether the output matches. Also, adds tests for general sorter and sort chunks using the introduced infrastructure. Fixes: #35922. Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns `2, 0, 1` and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2. I'm not sure whether `distsqlrun` package is the appropriate place, but it was the easiest package to place in. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

craig · 2019-04-02T23:38:06Z

Build succeeded

GitHub CI (Cockroach)

yuzefovich requested review from jordanlewis, georgeutsin, asubiotto and a team March 23, 2019 18:26

jordanlewis requested changes Mar 25, 2019

View reviewed changes

yuzefovich force-pushed the test_infra branch from 35a5fbe to 3e456e9 Compare March 25, 2019 17:13

yuzefovich commented Mar 25, 2019

View reviewed changes

georgeutsin reviewed Mar 25, 2019

View reviewed changes

yuzefovich force-pushed the test_infra branch from 3e456e9 to 938e3e3 Compare March 25, 2019 19:05

yuzefovich commented Mar 25, 2019

View reviewed changes

yuzefovich force-pushed the test_infra branch from 938e3e3 to 73eaf9f Compare March 28, 2019 02:11

yuzefovich commented Mar 28, 2019

View reviewed changes

yuzefovich force-pushed the test_infra branch 2 times, most recently from 2682223 to eae9f77 Compare March 28, 2019 20:07

yuzefovich force-pushed the test_infra branch from eae9f77 to a322bf8 Compare April 2, 2019 00:24

georgeutsin reviewed Apr 2, 2019

View reviewed changes

jordanlewis approved these changes Apr 2, 2019

View reviewed changes

yuzefovich commented Apr 2, 2019

View reviewed changes

craig bot merged commit a322bf8 into cockroachdb:master Apr 2, 2019

yuzefovich deleted the test_infra branch April 17, 2019 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsqlrun: add test infra to compare results of processors and operators #36081

distsqlrun: add test infra to compare results of processors and operators #36081

yuzefovich commented Mar 23, 2019

cockroach-teamcity commented Mar 23, 2019

jordanlewis left a comment

yuzefovich left a comment

georgeutsin left a comment

georgeutsin left a comment

yuzefovich left a comment

yuzefovich left a comment

yuzefovich commented Apr 2, 2019

yuzefovich commented Apr 2, 2019

georgeutsin left a comment

jordanlewis left a comment

yuzefovich left a comment

craig bot commented Apr 2, 2019

distsqlrun: add test infra to compare results of processors and operators #36081

distsqlrun: add test infra to compare results of processors and operators #36081

Conversation

yuzefovich commented Mar 23, 2019

cockroach-teamcity commented Mar 23, 2019

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

georgeutsin left a comment

Choose a reason for hiding this comment

georgeutsin left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich commented Apr 2, 2019

yuzefovich commented Apr 2, 2019

georgeutsin left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Apr 2, 2019

Build succeeded