Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/catalog/lease: acceptance/version-upgrade can sometimes wait a full lease duration leading to flakiness #84382

Closed
renatolabs opened this issue Jul 13, 2022 · 9 comments
Labels
A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Jul 13, 2022

The acceptance/version-upgrade test is known to be among the tests that fail more often in the GitHub CI build. The most common error when running that test happens when the cluster version never reaches the expected version after all nodes have been upgraded and restarted to use the new binary. The 5-minute timeout is reached, and the test fails.

Examples of this failure in TeamCity: 5633598, 5628020, and 5729515.

To Reproduce

It is possible to reproduce the error by running the test enough times. The failure rate seems to be around 3% based on the few hundred times I ran this test while trying to understand what's going on. To run the test 50 times, for example, the following command can be used:

$ ./bin/roachtest run --count 50 acceptance/version-upgrade

Observed Behavior

While running the migration for 22.1-12 (RemoveGrantPrivilege), failing runs of this test wait for a lease on a descriptor to be released for the entire duration of the lease (5 minutes by default). I also noted that, generally, only 1 node in the (4-node) cluster fails to release the lease, although I'm not sure if this is always the case.

The following was also done to make sure the issue is related to these leaked leases:

  • increase the lease duration to 15 mins: this has no impact on the failure rate of the test. This is expected, because -- with the default 5-minute lease duration -- when the lease leaks, the test fails with very high probability.
  • lower the lease duration to 1 minute: the failure is no longer observed in this case. This is also expected; by running the test enough times, I observed some runs where the lease was waited for 1 minute, and eventually the migration made progress. Since there were still 4 minutes left for the upgrade to run, it was able to complete successfully.

Jira issue: CRDB-17626

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-descriptors Relating to SQL table/db descriptor handling. A-schema-changes labels Jul 13, 2022
@srosenberg
Copy link
Member

Mostly as a reminder to self–the assertion which fails is of this form,

expected version 22.1-26, got 22.1-10

Thus, it's the RemoveGrantPrivilege migration which immediately follows TrigramInvertedIndexes.

@srosenberg
Copy link
Member

Looking at the logs in 5628020,

2.unredacted/cockroach.7bd0e3a90904.roach.2022-07-02T04_47_27Z.010242.log:I220702 04:47:35.417719 1517 sql/catalog/lease/lease.go:146 ⋮ [n2,job=775509951122046978,upgrade=22.1-12] 143  waiting for 1 leases to expire: desc=[{‹users› 4 5}]
2.unredacted/cockroach.7bd0e3a90904.roach.2022-07-02T04_47_27Z.010242.log:I220702 04:47:35.434757 1517 sql/catalog/lease/lease.go:146 ⋮ [n2,job=775509951122046978,upgrade=22.1-12] 146  waiting for 2 leases to expire: desc=[{‹settings› 6 2}]
2.unredacted/cockroach.7bd0e3a90904.roach.2022-07-02T04_47_27Z.010242.log:I220702 04:47:35.443493 1517 sql/catalog/lease/lease.go:146 ⋮ [n2,job=775509951122046978,upgrade=22.1-12] 148  waiting for 1 leases to expire: desc=[{‹settings› 6 2}]
2.unredacted/cockroach.7bd0e3a90904.roach.2022-07-02T04_47_27Z.010242.log:I220702 04:47:35.872098 1517 sql/catalog/lease/lease.go:146 ⋮ [n2,job=775509951122046978,upgrade=22.1-12] 155  waiting for 1 leases to expire: desc=[{‹eventlog› 12 3}]

Those are the only logged messages from WaitForNoVersion [1], but I don't see any subsequent retry attempts.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/catalog/lease/lease.go#L108

@ajwerner
Copy link
Contributor

but I don't see any subsequent retry attempts.

It only logs when the count changes

if count != lastCount {

@srosenberg
Copy link
Member

We might be able to pinpoint the leaker by running with --vmodule 2 which enables the tracing of acquired leases,

log.VEventf(ctx, 2, "storage attempting to acquire lease %v@%v", desc, expiration)

@renatolabs
Copy link
Contributor Author

One change we could make in the short term is to lower the lease duration (and revert the change when this issue is fixed). versionupgrade will still be flaky (there are other errors in there, unrelated to this one), but a lot less so.

Thoughts?

@ajwerner
Copy link
Contributor

Seems reasonable to me.

@srosenberg
Copy link
Member

What would be a reasonable value? E.g., is two minutes too short? Also, we could mess with it inside the versionupgrade test via sql.catalog.descriptor_lease_duration. @renatolabs Maybe do a few stress runs to see if lowering the duration also reduces the flakiness?

@renatolabs
Copy link
Contributor Author

What would be a reasonable value? E.g., is two minutes too short? Also, we could mess with it inside the versionupgrade test via sql.catalog.descriptor_lease_duration. @renatolabs Maybe do a few stress runs to see if lowering the duration also reduces the flakiness?

I did exactly as you suggest (but using 1 minute instead of 2) when I was investigating this issue, and it does work around this bug. I just re-ran the test 250 times to be extra sure, and the issue never occurred (without the change, I'd almost always see at least one occurrence if I ran it 50 times).

I'll open a PR, thanks all 👍

renatolabs added a commit to renatolabs/cockroach that referenced this issue Jul 15, 2022
…ade.

The `acceptance/version-upgrade` test uncovered a descriptor lease
leak that can lead to the test timing out due to waiting a full lease
duration (5 minutes by default), making it flaky.

Once the bug is fixed, we should be able to use the default duration
again.

Relates to cockroachdb#84382.

Release note: None.
renatolabs added a commit to renatolabs/cockroach that referenced this issue Jul 15, 2022
…ade.

The `acceptance/version-upgrade` test uncovered a descriptor lease
leak that can lead to the test timing out due to waiting a full lease
duration (5 minutes by default), making it flaky.

Once the bug is fixed, we should be able to use the default duration
again.

Relates to cockroachdb#84382.

Release note: None.
craig bot pushed a commit that referenced this issue Jul 15, 2022
84406: sql: make setup of leaf txn for streamer more bullet-proof r=yuzefovich a=yuzefovich

This commit makes sure that we try to use the streamer API only if we
can actually create a non-nil `LeafTxn`. In some edge cases it appears
that all the previous checks were insufficient, and this commit should
take care of that. Note that I couldn't reproduce those edge cases
manually, so there is no regression test.

Fixes: #84239.

Release note: None

84455: outliers: prepare for asynchronous processing r=matthewtodd a=matthewtodd

Moving a few things around in advance of the real work of #81021.

84485: ui: remove link to stmt details on sessions details r=maryliag a=maryliag

Previously, when a statment was active, it would show
on the Sessions page with a link to view its details,
but since the statement was not yet saved/persisted,
clicking the link it would crash the Statement Details
page.
This commit removes this link.

Fixes #84462

Release note (ui change): Removal of `View Statement Details`
link inside the Sessions Details page.

84488: roachtest: lower descriptor lease duration in acceptance/version-upgrade r=srosenberg a=renatolabs

The `acceptance/version-upgrade` test uncovered a descriptor lease
leak that can lead to the test timing out due to waiting a full lease
duration (5 minutes by default), making it flaky.

Once the bug is fixed, we should be able to use the default duration
again.

Relates to #84382.

Release note: None.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
@ajwerner
Copy link
Contributor

ajwerner commented Dec 6, 2022

One note was that I learned that the cause of some of the leaks was at one point due to #91116. Given how legit that bug was, we ought to understand what's going on in these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

3 participants