-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: nil pointer dereference in TestMergeQueue #63009
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Comments
irfansharif
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Apr 2, 2021
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Apr 26, 2021
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
craig bot
pushed a commit
that referenced
this issue
Apr 27, 2021
64199: kv: avoid panic on TestMergeQueue/non-collocated failure r=nvanbenschoten a=nvanbenschoten Informs #63009. Informs #64056. In #63009/#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for #62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail. The fact that `TestMergeQueue/non-collocated` is the test failing means that we may want to have @aayushshah15 take over this investigation, since he's made changes in that area recently. What do you two think? Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
May 2, 2021
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#64199, we found that the flake was likely due to the non-collocated subtest, so this commit un-skips the parent test and skips only this single subtest.
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Jun 1, 2021
The test started flaking after ce73bc1. The issue was that there was a race between when the merge queue was enabled via an `Override()` call and when these tests assert that a given range actually got merged away. In other words, these tests were sometime getting to their assertion before the relevant server learned that the merge queue was enabled. This patch modifies the test to instead enable the merge queue via SQL and not through the `Override()` call. Resolves cockroachdb#64866 Resolves cockroachdb#64056 Resolves cockroachdb#63009 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Jun 3, 2021
The test started flaking after ce73bc1. The issue was that there was a race between when the merge queue was enabled via an `Override()` call and when these tests assert that a given range actually got merged away. In other words, these tests were sometime getting to their assertion before the relevant server learned that the merge queue was enabled. This patch modifies the test to instead enable the merge queue via SQL and not through the `Override()` call. Resolves cockroachdb#64866 Resolves cockroachdb#64056 Resolves cockroachdb#63009 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jun 3, 2021
65912: kvserver: deflake TestMergeQueue r=aayushshah15 a=aayushshah15 The test started flaking after ce73bc1. The issue was that there was a race between when the merge queue was enabled via an `Override()` call and when these tests assert that a given range actually got merged away. In other words, these tests were sometime getting to their assertion before the relevant server learned that the merge queue was enabled. This patch modifies the test to instead enable the merge queue via SQL and not through the `Override()` call on the cluster setting. Resolves #64866 Resolves #64056 Resolves #63009 Release note: None 66039: backupccl: fixing flaky TestCleanupIntentsDuringBackupPerformanceRegression test r=rickystewart,erikgrinaker a=aliher1911 Increasing test timeout from 10 to 20 sec to stop CI failing. This value is safe as regressions take 60 sec when they start showing quadratic behaviours. Release note: None Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 8, 2021
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 8, 2021
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#64199, we found that the flake was likely due to the non-collocated subtest, so this commit un-skips the parent test and skips only this single subtest.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 16, 2021
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Describe the problem
See this build run against the 21.1 branch.
+cc @tbg for routing.
The text was updated successfully, but these errors were encountered: