-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: SET CLUSTER SETTING doesn't reliably wait for propagation when run on a tenant #87201
Comments
Notes from triage meeting: We don't currently guarantee that
Yeah, maybe simply documenting this is the way to go. @ajwerner do you have any thoughts? |
|
@HonoreDB regarding the linked test, do we need this blocking behavior from |
I'd say this is something of an edge case we can document away, and is primarily painful only for testing. |
I guess we do document this already:
Closing. |
A rangefeed is allowed to send previously seen values. When it did, it would result in the observed value of a setting regressing. There's no need for this: we can track some timestamps and ensure we do not regress. Fixes cockroachdb#87502 Relates to cockroachdb#87201 Release note (bug fix): In rare cases, the value of a cluster setting could regress soon after it was set. This no longer happens for a given gateway node.
I did something about this here: #87564. |
A rangefeed is allowed to send previously seen values. When it did, it would result in the observed value of a setting regressing. There's no need for this: we can track some timestamps and ensure we do not regress. Fixes cockroachdb#87502 Relates to cockroachdb#87201 Release note (bug fix): In rare cases, the value of a cluster setting could regress soon after it was set. This no longer happens for a given gateway node.
A rangefeed is allowed to send previously seen values. When it did, it would result in the observed value of a setting regressing. There's no need for this: we can track some timestamps and ensure we do not regress. Fixes cockroachdb#87502 Relates to cockroachdb#87201 Release note (bug fix): In rare cases, the value of a cluster setting could regress soon after it was set. This no longer happens for a given gateway node.
87564: server/settingswatcher: track timestamps so values do not regress r=ajwerner a=ajwerner A rangefeed is allowed to send previously seen values. When it did, it would result in the observed value of a setting regressing. There's no need for this: we can track some timestamps and ensure we do not regress. Fixes #87502 Relates to #87201 Release note (bug fix): In rare cases, the value of a cluster setting could regress soon after it was set. This no longer happens for a given gateway node. Co-authored-by: Andrew Werner <[email protected]>
A rangefeed is allowed to send previously seen values. When it did, it would result in the observed value of a setting regressing. There's no need for this: we can track some timestamps and ensure we do not regress. Fixes #87502 Relates to #87201 Release note (bug fix): In rare cases, the value of a cluster setting could regress soon after it was set. This no longer happens for a given gateway node.
This knob was being used by default to subvert the settings infrastructure in tenants on the local node. This lead to hazardous interactions with the settingswatcher behavior. That library tries quite hard to synchronous updates to settings and ensure that they do not regress. By setting the setting above that layer, we could very much see them regress. As far as I can tell, this code came about before tenants could actually manage settings for themselves. In practice, this code would run prior to the transaction writing the setting running, which generally meant that so long as you didn't flip settings back and forth, things would work out. Nevertheless, it was tech debt and is now removed. Fixes cockroachdb#87017 Informs cockroachdb#87201 Release note: None
91565: server,settings: remove vestigial tenant-only testing knob r=ajwerner a=ajwerner This knob was being used by default to subvert the settings infrastructure in tenants on the local node. This lead to hazardous interactions with the settingswatcher behavior. That library tries quite hard to synchronous updates to settings and ensure that they do not regress. By setting the setting above that layer, we could very much see them regress. As far as I can tell, this code came about before tenants could actually manage settings for themselves. In practice, this code would run prior to the transaction writing the setting running, which generally meant that so long as you didn't flip settings back and forth, things would work out. Nevertheless, it was tech debt and is now removed. Fixes #87017 Informs #87201 Release note: None Co-authored-by: Andrew Werner <[email protected]>
Previously, the `tpcc/mixed-headroom` roachtests would reset the `preserve_downgrade_option` setting and then wait for the upgrade to finish by running a `SET CLUSTER SETTING version = '...'` statement. However, that is not reliable as it's possible for that statement to return an error if the resetting of the `preserve_downgrade_option` has not been propagated yet (see cockroachdb#87201). To avoid this type of flake (which has been observed in manual runs), we use a retry loop waiting for the cluster version to converge, as is done by the majority of upgrade-related roachtests. Epic: None. Release note: None
92140: roachtest: geo distributed roachtests which specify a number of nodes… r=smg260 a=smg260 … strictly less than the default number of gcloud zones (currently 9) result in a gcloud syntax error for omission of instance name. This happens because we loop through `len(zones)` instead of `len(nodes)` Resolves: #92150 Release note: none Epic: none 92147: testcluster: don't swallow node start error r=andreimatei a=andreimatei The buggy code path had the structure: ``` func bug() { var err error if err := foo(); err != nil { if cond { goexit } } return err } ``` This inadvertently swallws foo()'s error when `!cond`, because foo's error has a smaller scope than the error that ends up being returned. Release note: None Epic: None 92153: roachtest: wait for upgrade to complete using retry loop in tpcc r=srosenberg a=renatolabs Previously, the `tpcc/mixed-headroom` roachtests would reset the `preserve_downgrade_option` setting and then wait for the upgrade to finish by running a `SET CLUSTER SETTING version = '...'` statement. However, that is not reliable as it's possible for that statement to return an error if the resetting of the `preserve_downgrade_option` has not been propagated yet (see #87201). To avoid this type of flake (which has been observed in manual runs), we use a retry loop waiting for the cluster version to converge, as is done by the majority of upgrade-related roachtests. Epic: None. Release note: None 92166: cmd: increase logictestccl stress timeout to 2h r=rytaft a=rytaft The default timeout of 1h is not enough. Increase it to 2h to match the regular logictests. Fixes #92108 Release note: None Co-authored-by: Miral Gadani <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Rebecca Taft <[email protected]>
This test fails or flakes on large enough n: you can't quite rely on logic that does things like
Because even if you're the only database client, it's possible that the cluster settings won't be what you think they are after setting them.
This has come up a few times in tests since they often test behavior against multiple cluster settings in the same session. In theory though it could also happen in real life: there are some cluster settings that are the only way to control behavior, so if you want to set them "per statement", you need to change them repeatedly.
This is probably related to the implementation of waitForSettingUpdate and/or the rangefeed consumer that propagates settings, but it's not obvious to me where the race condition is. I think ideally waitForSettingUpdate would wait until it sees a setting being propagated that's tagged as being from its own unique statement id, or we should document that just as you can't set cluster settings in a transaction or multi-statement block, you can't rely on sessions being ordered with respect to them.
Jira issue: CRDB-19212
The text was updated successfully, but these errors were encountered: