-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/catalog/lease: acceptance/version-upgrade can sometimes wait a full lease duration leading to flakiness #84382
Comments
Mostly as a reminder to self–the assertion which fails is of this form,
Thus, it's the |
Looking at the logs in 5628020,
Those are the only logged messages from [1] https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/catalog/lease/lease.go#L108 |
It only logs when the count changes cockroach/pkg/sql/catalog/lease/lease.go Line 106 in 88d3253
|
We might be able to pinpoint the leaker by running with cockroach/pkg/sql/catalog/lease/storage.go Line 144 in 88d3253
|
One change we could make in the short term is to lower the lease duration (and revert the change when this issue is fixed). Thoughts? |
Seems reasonable to me. |
What would be a reasonable value? E.g., is two minutes too short? Also, we could mess with it inside the |
I did exactly as you suggest (but using 1 minute instead of 2) when I was investigating this issue, and it does work around this bug. I just re-ran the test 250 times to be extra sure, and the issue never occurred (without the change, I'd almost always see at least one occurrence if I ran it 50 times). I'll open a PR, thanks all 👍 |
…ade. The `acceptance/version-upgrade` test uncovered a descriptor lease leak that can lead to the test timing out due to waiting a full lease duration (5 minutes by default), making it flaky. Once the bug is fixed, we should be able to use the default duration again. Relates to cockroachdb#84382. Release note: None.
…ade. The `acceptance/version-upgrade` test uncovered a descriptor lease leak that can lead to the test timing out due to waiting a full lease duration (5 minutes by default), making it flaky. Once the bug is fixed, we should be able to use the default duration again. Relates to cockroachdb#84382. Release note: None.
84406: sql: make setup of leaf txn for streamer more bullet-proof r=yuzefovich a=yuzefovich This commit makes sure that we try to use the streamer API only if we can actually create a non-nil `LeafTxn`. In some edge cases it appears that all the previous checks were insufficient, and this commit should take care of that. Note that I couldn't reproduce those edge cases manually, so there is no regression test. Fixes: #84239. Release note: None 84455: outliers: prepare for asynchronous processing r=matthewtodd a=matthewtodd Moving a few things around in advance of the real work of #81021. 84485: ui: remove link to stmt details on sessions details r=maryliag a=maryliag Previously, when a statment was active, it would show on the Sessions page with a link to view its details, but since the statement was not yet saved/persisted, clicking the link it would crash the Statement Details page. This commit removes this link. Fixes #84462 Release note (ui change): Removal of `View Statement Details` link inside the Sessions Details page. 84488: roachtest: lower descriptor lease duration in acceptance/version-upgrade r=srosenberg a=renatolabs The `acceptance/version-upgrade` test uncovered a descriptor lease leak that can lead to the test timing out due to waiting a full lease duration (5 minutes by default), making it flaky. Once the bug is fixed, we should be able to use the default duration again. Relates to #84382. Release note: None. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Renato Costa <[email protected]>
One note was that I learned that the cause of some of the leaks was at one point due to #91116. Given how legit that bug was, we ought to understand what's going on in these. |
The
acceptance/version-upgrade
test is known to be among the tests that fail more often in the GitHub CI build. The most common error when running that test happens when the cluster version never reaches the expected version after all nodes have been upgraded and restarted to use the new binary. The 5-minute timeout is reached, and the test fails.Examples of this failure in TeamCity: 5633598, 5628020, and 5729515.
To Reproduce
It is possible to reproduce the error by running the test enough times. The failure rate seems to be around 3% based on the few hundred times I ran this test while trying to understand what's going on. To run the test 50 times, for example, the following command can be used:
Observed Behavior
While running the migration for 22.1-12 (
RemoveGrantPrivilege
), failing runs of this test wait for a lease on a descriptor to be released for the entire duration of the lease (5 minutes by default). I also noted that, generally, only 1 node in the (4-node) cluster fails to release the lease, although I'm not sure if this is always the case.The following was also done to make sure the issue is related to these leaked leases:
Jira issue: CRDB-17626
The text was updated successfully, but these errors were encountered: