Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsqlrun: add test infra to compare results of processors and operators #36081

Merged
merged 1 commit into from
Apr 2, 2019

Conversation

yuzefovich
Copy link
Member

Adds test infrastructure that sets up a processor and the corresponding
columnar operator (as well as necessary columnarizers and materializers),
runs both paths, and checks whether the output matches.

Also, adds tests for general sorter and sort chunks using the introduced
infrastructure.

Fixes: #35922.

Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns 2, 0, 1 and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2.

I'm not sure whether distsqlrun package is the appropriate place, but it was the easiest package to place in.

Release note: None

@yuzefovich yuzefovich requested review from jordanlewis, georgeutsin, asubiotto and a team March 23, 2019 18:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice work! I think it will be useful for lots of things, including the merge joiner.

It looks like it doesn't always pass yet though in case you hadn't noticed:

=== RUN   TestSortChunksAgainstProcessor
[14:36:32]--- FAIL: TestSortChunksAgainstProcessor (0.03s)
[14:36:32]    columnar_operators_test.go:78: different results on row 0;
[14:36:32]        expected:
[14:36:32]           [0 1 9]
[14:36:32]        got:
[14:36:32]           [5 6 9]

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)


pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):

					"processor output:\n		%s\ncolumnar operator output:\n		%s", i, procRows.String(outputTypes), colOpRows.String(outputTypes))
			}
		}

Don't you also need to verify that used is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm aware of the bug and will fix it shortly.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)


pkg/sql/distsqlrun/columnar_utils_test.go, line 159 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Don't you also need to verify that used is all true? Also this algorithm is O(n^2) - are the input sizes small enough that it doesn't matter? You could also sort the two slices or use a map, but you'd have to use some kind of key encoding thing to do the sort which might be a minor pain.

I'm not checking used because colOpRows and procRows at this point necessarily have the same number of rows - if it weren't true, it would have been caught in the reading loop when either of the "producers" outputted a row while the other didn't. (I left a comment about this.)

At the moment, I'm imagining that the input sizes will be fairly small, so I don't think a squared algorithm is a problem. If, however, we decide to use it on large inputs, we'll need to adjust it.

Copy link
Contributor

@georgeutsin georgeutsin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)


pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):

			break
		} else {
			if rowProc == nil {

Since both statements in the both modify control flow, this block doesn't have to be in an else statement.

Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like

if (rowProc == nil) != (rowColOp == nil) {
    //build the error string out based on the nil values of either at this point on
    //different results, return error
}

if (rowProc == nil) && (rowColOp == nil) {
    break
}

// both are non nil, so continue

pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):

// MakeRandIntRowsModulus constructs a numRows * numCols table where the values
// are random integers in the range [0, modulus).
func MakeRandIntRowsModulus(rng *rand.Rand, numRows int, numCols int, modulus int) EncDatumRows {

What's the reasoning behind having modulus in this function name?
Would something like MakeRanIntInRange be clearer maybe?

Copy link
Contributor

@georgeutsin georgeutsin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering, why don't we just assume all tests are anyOrder? Correct me if I'm wrong, but I'm thinking that the clarity might outweigh the performance gained, especially on a test

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, @jordanlewis, and @yuzefovich)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that there are operators that are expected to return the results in a precise order (for example, sorter), so then anyOrder is false, but there are also operators that can return the results in an arbitrary order (for example, hash joiner), so then anyOrder is true.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)


pkg/sql/distsqlrun/columnar_utils_test.go, line 122 at r2 (raw file):

Previously, georgeutsin (George Utsin) wrote…

Since both statements in the both modify control flow, this block doesn't have to be in an else statement.

Taking a second look, it seems like its an XOR, so you could probably change the boolean logic to something like

if (rowProc == nil) != (rowColOp == nil) {
    //build the error string out based on the nil values of either at this point on
    //different results, return error
}

if (rowProc == nil) && (rowColOp == nil) {
    break
}

// both are non nil, so continue

Thanks for the suggestion. I refactored the code, and it should now be a lot more comprehensible.


pkg/sql/sqlbase/testutils.go, line 691 at r2 (raw file):

Previously, georgeutsin (George Utsin) wrote…

What's the reasoning behind having modulus in this function name?
Would something like MakeRanIntInRange be clearer maybe?

Done.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)


pkg/sql/distsqlrun/columnar_utils_test.go, line 35 at r3 (raw file):

// anyOrder determines whether the results should be matched in order (when
// anyOrder is false) or as sets (when anyOrder is true).
func verifyColOperator(

While recently making some changes to the general sorter and sort chunks tests, I realized that there is a third possibility of order - "partial.". For example, when we have input with two columns, but it should be sorted only on one of the columns, when rows are equal on that single column, the order is arbitrary. Possibly, verifyColOperator should take in an optional function that would check whether two sets of results from a processor and a columnar operator are "equivalent" - whether they both satisfy the same partial ordering. But maybe this is an overkill, and in any case, I think we should merge this guy and do a follow up to introduce the check for partial order (if we decide it's worth it).

@yuzefovich yuzefovich force-pushed the test_infra branch 2 times, most recently from 2682223 to eae9f77 Compare March 28, 2019 20:07
…tors

Adds test infrastructure that sets up a processor and the corresponding
columnar operator (as well as necessary columnarizers and materializers),
runs both paths, and checks whether the output matches.

Also, adds tests for general sorter and sort chunks using the introduced
infrastructure.

Release note: None
@yuzefovich
Copy link
Member Author

@georgeutsin @jordanlewis PTAL at this guy.

@yuzefovich
Copy link
Member Author

Also, I'm not sure what our policy on files names here: I know that we're calling the engine "vectorized" but the operators are "columnar".

Copy link
Contributor

@georgeutsin georgeutsin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: This is good stuff, let's keep iterating to see if we can improve the usability/come up with the right abstractions

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... regarding the vector/column stuff, I would say that it's good to use the word column when we're talking about data columns, so the naming in general is okay. Vectorized is what people tend to call the technique of operating on sql data in a column-at-a-time fashion.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

bors r+

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @asubiotto, @georgeutsin, and @jordanlewis)

craig bot pushed a commit that referenced this pull request Apr 2, 2019
36081: distsqlrun: add test infra to compare results of processors and operators r=yuzefovich a=yuzefovich

Adds test infrastructure that sets up a processor and the corresponding
columnar operator (as well as necessary columnarizers and materializers),
runs both paths, and checks whether the output matches.

Also, adds tests for general sorter and sort chunks using the introduced
infrastructure.

Fixes: #35922.

Sort chunks test actually found a bug when ordering columns are not "in order", i.e. when the ordering is, for example, on columns `2, 0, 1` and the prefix match len is 1, the chunker wrongly assumes that the input is already ordered on column 0 whereas it is actually ordered on column 2.

I'm not sure whether `distsqlrun` package is the appropriate place, but it was the easiest package to place in.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 2, 2019

Build succeeded

@craig craig bot merged commit a322bf8 into cockroachdb:master Apr 2, 2019
@yuzefovich yuzefovich deleted the test_infra branch April 17, 2019 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exec: add test infrastructure to compare results of running operators against processors' output
4 participants