Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: avoid panic on TestMergeQueue/non-collocated failure #64199

Merged

Conversation

nvanbenschoten
Copy link
Member

Informs #63009.
Informs #64056.

In #63009/#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for #62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the reset stage of TestMergeQueue/sticky-bit because the previous subtest,
TestMergeQueue/non-collocated, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:

--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found

I expect that under stress on master, we will see the TestMergeQueue/non-collocated subtest fail.

The fact that TestMergeQueue/non-collocated is the test failing means that we may want to have @aayushshah15 take over this investigation, since he's made changes in that area recently. What do you two think?

Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hand it over to Aayush.

@nvanbenschoten
Copy link
Member Author

bors r+

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 27, 2021

Build failed:

@nvanbenschoten
Copy link
Member Author

Unrelated flake.

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 27, 2021

Build succeeded:

@craig craig bot merged commit f3286c1 into cockroachdb:master Apr 27, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/rangeMergeTestFlake branch April 28, 2021 03:27
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 2, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#64199, we found that the flake was likely due to the non-collocated
subtest, so this commit un-skips the parent test and skips only this
single subtest.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 8, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#64199, we found that the flake was likely due to the non-collocated
subtest, so this commit un-skips the parent test and skips only this
single subtest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants