-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191
Comments
I reproed on dff4132 with:
|
👍 thanks for crossing the t, cc @andreimatei |
I believe this test is likely what's causing our mysterious 1h timeouts for master TC builds. I've reproed it running for 40 min when being run in parallel with other packages (as it is on TC). |
The badness starts happening around Apr 10th. Exactly when I can't tell yet; I've tricked myself bisecting once already because the amount of time it takes to get a failure seems to vary. In fact, I have yet to find any commit where the stress can survive 5 min with Every time it times out, this is the trace for the test's goroutine. It's blocked on a channel in a test knob it adds.
|
Does the failure rate decrease as you increase the capacity of the channel that we're getting stuck on? cockroach/pkg/storage/client_merge_test.go Line 2861 in 468b93b
Perhaps the increased load from running the test in parallel is leading to lots of retries of the merge transaction. |
See cockroachdb#37191 Release note: None
As @andy-kimball pointed out, this might be good for someone who's not Nathan, Andrei, or myself to look at. @andreimatei, wanna delegate? |
(PS I just sent a PR to skip the test) |
37229: storage: skip TestRangeMergeSlowWatcher r=nvanbenschoten a=tbg See #37191 Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
@darinpp are you interested in this one? |
Yes, I'll take a look |
To repro I run something like
The failures seems to have two regimes. They happen very quickly since around |
I'm interested in taking a closer look at this one, since I have some time today. Darin's handing it over; if I can't make progress in time I have I can hand back. |
I did notice that logs had exactly 10 lines like this:
So it's very likely that it's blocking on:
Still trying to understand why we'd initiate the same merge 10 times in quick succession (roughly once a second). |
Here is what I think is happening (though there are still gaps in my understanding):
|
Another thing is that I think a 1 minute timeout might give false repros, which can muddy the waters. Under heavy load, this test runs very slowly, with lots of timeouts and retries. So just because one of the parallel tests times out doesn't mean that we've reproduced the 45 minute deadlock shown in the original teamcity runs. As an illustration of this, I tried setting timeouts low and then increasing them:
Probably if I set the timeout to 1m, the test would fail after ~10s of minutes of heavy stress, as @andreimatei observed. This doesn't indicate it has deadlocked (i.e. true repro), only that the timeout is too short given the load. This is an important point because the fix I'm working on still eventually fails with a timeout of 1m, but looking at the logs I believe these are "false" failures. If I increase the timeout to 5m, I don't see any failures during a 1h run I tried. |
Fixed in #37477 |
https://teamcity.cockroachdb.com/viewLog.html?buildId=1265447&buildTypeId=Cockroach_UnitTests
https://teamcity.cockroachdb.com/viewLog.html?buildId=1265559&buildTypeId=Cockroach_UnitTests
The text was updated successfully, but these errors were encountered: