Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: unexpected "rangefeeds require kv.rangefeed.enabled" on system tables #76331

Closed
RaduBerinde opened this issue Feb 9, 2022 · 12 comments · Fixed by #76420
Closed

sql: unexpected "rangefeeds require kv.rangefeed.enabled" on system tables #76331

RaduBerinde opened this issue Feb 9, 2022 · 12 comments · Fixed by #76420
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@RaduBerinde
Copy link
Member

RaduBerinde commented Feb 9, 2022

I am running into a problem which seems to be caused by system table rangefeeds in the tenant not working. I distilled the change I was working on to just a delay during tenant initialization.

Apply this diff:

diff --git a/pkg/server/node.go b/pkg/server/node.go
index 3ad254081e..4dd5fd236a 100644
--- a/pkg/server/node.go
+++ b/pkg/server/node.go
@@ -1390,6 +1390,7 @@ func (n *Node) TenantSettings(
 ) error {
 	ctx := n.storeCfg.AmbientCtx.AnnotateCtx(stream.Context())
 	ctxDone := ctx.Done()
+	time.Sleep(10 * time.Second)
 
 	// TODO(radu): implement this. For now we just respond with an initial event
 	// so the tenant can initialize.

Then run:

make test PKG=./pkg/ccl/logictestccl 'TESTS=TestTenantLogic/3node-tenant#01/tenant$$' TESTFLAGS='-v -show-logs' TESTTIMEOUT=30s

Sometimes the test passes, but usually it gets stuck and you see this in logs:

W220209 20:33:47.697524 5665 kv/kvclient/rangefeed/rangefeed.go:301  [nsql1,rangefeed=sql_instances] 190  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:157: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency
W220209 20:33:47.699613 2113 kv/kvclient/rangefeed/rangefeed.go:301  [nsql1,rangefeed=table-stats-cache] 191  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:157: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latenc> y
W220209 20:33:47.701328 5772 kv/kvclient/rangefeed/rangefeed.go:301  [nsql1,rangefeed=settings-watcher] 192  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:157: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency
W220209 20:33:47.703405 5775 kv/kvclient/rangefeed/rangefeed.go:301  [nsql1,rangefeed=system-config-cache] 193  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:157: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-late> ncy
W220209 20:33:47.712925 6384 kv/kvclient/rangefeed/rangefeed.go:301  [nsql?,rangefeed=table-stats-cache] 194  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:157: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latenc> y

This is on 4df7f50.

@ajwerner suspects #74555.

@RaduBerinde RaduBerinde added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 9, 2022
@RaduBerinde
Copy link
Member Author

@yuzefovich ran into this as well (he hit ERROR: setting updated but timed out waiting to read new value when setting a cluster setting). I reproduced with his data directory and saw the same logs as above.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 11, 2022
Not doing this causes problems when we construct new tenants as they won't be
able to establish a rangefeed until their first full reconciliation completes
and then propagates.

Even this is not great. If the preceding range did not have rangefeeds enabled,
it would take a closed timestamp interval for this enablement to propagate.
Perhaps this is evidence that we should always carve out a span at the end of
the keyspace and set it to have rangefeeds enabled.

I'll leave that fix and testing of this to somebody else. Hope this is helpful
enough on its own. cc @irfansharif and @arulajmani. I believe this problem
has been blocking @RaduBerinde.

Fixes cockroachdb#76331

Release note: None
@ajwerner
Copy link
Contributor

#76420 is something and attacks the heart of the problem. There's still some rough edges to iron out here.

@onlysavior

This comment was marked as spam.

@RaduBerinde
Copy link
Member Author

@yuzefovich ran into this as well (he hit ERROR: setting updated but timed out waiting to read new value when setting a cluster setting). I reproduced with his data directory and saw the same logs as above.

Some more info here - Yahor ran into a problem and he gave me the cockroach-data dir. I started a regular node with that dir and saw these messages:

W220211 03:49:13.566659 530 kv/kvclient/rangefeed/rangefeed.go:301 ⋮ [n1,rangefeed=‹sql_instances›] 205  rangefeed failed 9 times, restarting: ‹kv/kvserver/replica_rangefeed.go›:157: rangefeeds require the kv.rangefeed.enabled setting. See ‹https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency›
W220211 03:49:13.635930 891 kv/kvclient/rangefeed/rangefeed.go:301 ⋮ [n1,rangefeed=‹settings-watcher›] 206  rangefeed failed 9 times, restarting: ‹kv/kvserver/replica_rangefeed.go›:157: rangefeeds require the kv.rangefeed.enabled setting. See ‹https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency›
I220211 03:49:13.715993 497 sql/sqlstats/persistedsqlstats/flush.go:32 ⋮ [n1] 207  flushing 2 stmt/txn fingerprints (10240 bytes) after 1.0089529s
W220211 03:49:13.861032 177 kv/kvclient/rangefeed/rangefeed.go:301 ⋮ [n1,rangefeed=‹table-stats-cache›] 208  rangefeed failed 10 times, restarting: ‹kv/kvserver/replica_rangefeed.go›:157: rangefeeds require the kv.rangefeed.enabled setting. See ‹https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency›
W220211 03:49:14.084262 299 kv/kvclient/rangefeed/rangefeed.go:301 ⋮ [n1,rangefeed=‹system-config-cache›] 209  rangefeed failed 9 times, restarting: ‹kv/kvserver/replica_rangefeed.go›:157: rangefeeds require the kv.rangefeed.enabled setting. See ‹https://www.cockroachlabs.com/docs/v22.1/change-data-capture.html#enable-rangefeeds-to-reduce-latency›

And I was able to reproduce ERROR: setting updated but timed out waiting to read new value when changing a cluster setting.

@ajwerner
Copy link
Contributor

@yuzefovich ran into this as well (he hit ERROR: setting updated but timed out waiting to read new value when setting a cluster setting). I reproduced with his data directory and saw the same logs as above.

Any chance I can get this data directory? On some level it's going to have to do with the span configs being incorrect. A scan of SELECT crdb_internal.pretty_key(start_key, 0), crdb_internal.pretty_key(end_key, 0), crdb_internal.pb_to_json('cockroach.roachpb.SpanConfig', config) from system.span_configurations will hold good clues.

@yuzefovich
Copy link
Member

Sent you a link on slack (out of paranoia of that directory containing some confidential information).

@ajwerner
Copy link
Contributor

Classic lack of version upgrade testing ☹️. There's a funny problem here. We watch for changes in span configs using a rangefeed. We used to decide if rangefeeds were enabled by looking at the keys of the replica but we stripped that in https://github.com/cockroachdb/cockroach/pull/74555/files#diff-b080aecf7d979c0b5a4c43973db709d0d866f137cb4e961ebfc0d979ad22c003. The problem is, in the mixed version state, we won't support the rangefeeds we need. That's bad and means we need to put the old logic back. The good news about the old logic is it's all still valid.

I'm not getting some ambient feels of concern about upgrade considerations for the tenants themselves, but I think if we just wholesale put the old logic back, we'll be fine.

@ajwerner ajwerner added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 11, 2022
@ajwerner
Copy link
Contributor

I think @RaduBerinde you'll be fine with #76420 for new tenants in new cluster and that new clusters bootstrapped off of master will be fine. We'll definitely need to do some mixed version testing. Sort of appalling that there's no mixed version test which sets a cluster setting.

@arulajmani
Copy link
Collaborator

Maybe I'm missing something, but I'm not entirely sure how we end up in this mixed version case. Looking here:

if id <= keys.MaxReservedDescID {
// We enable rangefeeds for system tables; various internal
// subsystems (leveraging system tables) rely on rangefeeds to
// function.
spanConfig.RangefeedEnabled = true
// We exclude system tables from strict GC enforcement, it's
// only really applicable to user tables.
spanConfig.GCPolicy.IgnoreStrictEnforcement = true
}

Doesn't this mean we should be supporting the rangefeeds we need? It also looks like we have this logic to support rangefeeds if a config is not explicitly set on a replica.


After typing all of that, I came across this logic where we seem to be swallowing an error and applying the default span config on the replica -- maybe this is what's causing the issue?

conf, err := sysCfg.GetSpanConfigForKey(ctx, key)
if err != nil {
if log.V(1) {
log.Infof(context.TODO(), "failed to get span config for key %s", key)
}
conf = s.cfg.DefaultSpanConfig
}
if s.cfg.SpanConfigsDisabled ||
!spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) ||
!s.cfg.Settings.Version.IsActive(ctx, clusterversion.EnableSpanConfigStore) {
repl.SetSpanConfig(conf)
}

If we don't want to bring back the old key stripping logic, maybe we can set RangefeedEnabled to be true on the DefaultSpanConfig that we have on the kvserver.StoreConfig? Essentially, moving this logic a layer above:

// We use the span configs infra to control whether rangefeeds are
// enabled on a given range. At the moment this only applies to
// system tables (on both host and secondary tenants). We need to
// consider two things:
// - The sql-side reconciliation process runs asynchronously. When
// the config for a given range is requested, we might not yet have
// it, thus falling back to the static config below.
// - Various internal subsystems rely on rangefeeds to function.
//
// Consequently, we configure our static fallback config to actually
// allow rangefeeds. As the sql-side reconciliation process kicks
// off, it'll install the actual configs that we'll later consult.
// For system table ranges we install configs that allow for
// rangefeeds. Until then, we simply allow rangefeeds when a more
// targeted config is not found.
fallbackConf := storeCfg.DefaultSpanConfig
fallbackConf.RangefeedEnabled = true
// We do the same for opting out of strict GC enforcement; it
// really only applies to user table ranges
fallbackConf.GCPolicy.IgnoreStrictEnforcement = true

craig bot pushed a commit that referenced this issue Feb 12, 2022
76394: sql: use IndexFetchSpec in TableReader r=RaduBerinde a=RaduBerinde

#### row: replace fetcher args with IndexFetchSpec

This commit replaces the row fetcher table args with IndexFetchSpec.

Release note: None

#### sql: use IndexFetchSpec in TableReader

This commit removes the table descriptor from TableReader and replaces
it with an IndexFetchSpec.

Eventually, we will use the same `IndexFetchSpec` to form the columnar
batch directly in KV.

Release note: None


76415: dev: address a few paper cuts r=irfansharif a=irfansharif

See individual commits, works through a few of the known paper cuts with dev/bazel (#75453).

76420: sql: initialize tenant span config with rangefeeds enabled r=irfansharif a=ajwerner

Not doing this causes problems when we construct new tenants as they won't be
able to establish a rangefeed until their first full reconciliation completes
and then propagates.

Even this is not great. If the preceding range did not have rangefeeds enabled,
it would take a closed timestamp interval for this enablement to propagate.
Perhaps this is evidence that we should always carve out a span at the end of
the keyspace and set it to have rangefeeds enabled.

I'll leave that fix and testing of this to somebody else. Hope this is helpful
enough on its own. cc `@irfansharif`  and `@arulajmani.` I believe this problem
has been blocking `@RaduBerinde.`

Fixes #76331

Release note: None

76437: clisqlshell: handle interactive query cancellation r=rafiss a=knz

Fixes #76433. 

As of this PR, there's a bug in lib/pq which forces the session to
terminate when any query gets cancelled. We find this unacceptable
and we plan to fix it later.

Release note (cli change): The interactive SQL shell (`cockroach sql`,
`cockroach demo`) now supports interrupting a currently running
query with Ctrl+C, without losing access to the shell.

76442: roachtest: try to stabilize ORM nightlies r=RichardJCai a=rafiss

fixes #76428
fixes #76426
fixes #68363
touches #76231
touches #76016
touches #76011

Ruby-pg and psycopg2 tests started passing because we now support pgwire
cancel.

Other tests (Django, Hibernate, and PGJDBC) seem flaky after the testing
settings change, so try to use slightly more conservative values.

Release note: None

76450: ci: add extra logging of `$TC_SERVER_URL` in pebble metamorphic nightly r=nicktrav a=rickystewart

Release note: None

Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in eee2148 Feb 12, 2022
@yuzefovich
Copy link
Member

@ajwerner should this issue be closed by #76420?

@irfansharif
Copy link
Contributor

Yes, I think so. There's a migration bug theorized above, but I don't think it holds -- sending a PR with a test to prove it. As for how #74555 could affect the non-system tenant, I'm not sure yet. Do you have a repro I could use to see what you were seeing?

irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 12, 2022
\cockroachdb#74555 starts using the span configs infrastructure to control whether
rangefeeds are enabled over a given range. Before dynamic system table
IDs (cockroachdb#76003), we used the range's key boundaries to determine whether
the range in question was for a system table ID. In mixed-version
clusters, it's possible to have both forms of this check. To ensure
things work in this form (something we suspected in cockroachdb#76331), we add a
test.

NB: The reason things still work is because in cockroachdb#74555 we modified the
system config span to hard code the relevant config fields for constant
system table IDs -- behaving identically to previous version nodes.

Release note: None
@irfansharif
Copy link
Contributor

irfansharif commented Feb 13, 2022

I took a look at the zipped store being referenced above, confirming that I saw the same as described -- rangefeeds were wedged on server startup. I also noticed that if the server process was restarted, things worked just fine. Eventually I just disabled the reconciliation job and poked at the store's contents of system.span_configurations, and noticed that even for system tables, it didn't have config.RangefeedEnabled set. I'm guessing that the store was created off of a binary that did not have #74555. Because the span config proto did not know about this field, on KV store start up when scanning from that table, we applied config.RangefeedEnabled == false for system database ranges, blocking all rangefeeds. The reason restarting the server seemed to resolve the problem is because the first time the server range, the reconciliation job will have successfully run, overwriting system.span_configuration protos with the right field set. The reason it was not picked up by process the first time time is because the per-store component that listens in on updates to the table also made use of rangefeeds, so was similarly wedged. This was not a problem post-restart.

Anyway, I don't think there's anything to do here, not a bug AFAICT. #76466 confirms that mixed-version state with rangefeeds work just fine; we were careful about hardcoding the right defaults in #74555.

RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Fixes cockroachdb#76331.

Not doing this causes problems when we construct new tenants as they
won't be able to establish a rangefeed until their first full
reconciliation completes and then propagates. This isn't "buggy", but it
is slow.

Even this is not great. If the preceding range did not have rangefeeds
enabled, it would take a closed timestamp interval for this enablement
to propagate. Perhaps this is evidence that we should always carve out a
span at the end of the keyspace and set it to have rangefeeds enabled.
I'll leave that fix and testing of this to somebody else. Hope this is
helpful enough on its own.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 29, 2022
\cockroachdb#74555 starts using the span configs infrastructure to control whether
rangefeeds are enabled over a given range. Before dynamic system table
IDs (cockroachdb#76003), we used the range's key boundaries to determine whether
the range in question was for a system table ID. In mixed-version
clusters, it's possible to have both forms of this check. To ensure
things work in this form (something we suspected in cockroachdb#76331), we add a
test.

NB: The reason things still work is because in cockroachdb#74555 we modified the
system config span to hard code the relevant config fields for constant
system table IDs -- behaving identically to previous version nodes.

Release note: None
craig bot pushed a commit that referenced this issue Mar 30, 2022
76466: spanconfig: verify migration for rangefeed enablement r=irfansharif a=irfansharif

\#74555 starts using the span configs infrastructure to control whether
rangefeeds are enabled over a given range. Before dynamic system table
IDs (#76003), we used the range's key boundaries to determine whether
the range in question was for a system table ID. In mixed-version
clusters, it's possible to have both forms of this check. To ensure
things work in this form (something we suspected in #76331), we add a
test.

NB: The reason things still work is because in #74555 we modified the
system config span to hard code the relevant config fields for constant
system table IDs -- behaving identically to previous version nodes.

Release note: None


Co-authored-by: irfan sharif <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 30, 2022
\#74555 starts using the span configs infrastructure to control whether
rangefeeds are enabled over a given range. Before dynamic system table
IDs (#76003), we used the range's key boundaries to determine whether
the range in question was for a system table ID. In mixed-version
clusters, it's possible to have both forms of this check. To ensure
things work in this form (something we suspected in #76331), we add a
test.

NB: The reason things still work is because in #74555 we modified the
system config span to hard code the relevant config fields for constant
system table IDs -- behaving identically to previous version nodes.

Release note: None
fqazi pushed a commit to fqazi/cockroach that referenced this issue Apr 4, 2022
\cockroachdb#74555 starts using the span configs infrastructure to control whether
rangefeeds are enabled over a given range. Before dynamic system table
IDs (cockroachdb#76003), we used the range's key boundaries to determine whether
the range in question was for a system table ID. In mixed-version
clusters, it's possible to have both forms of this check. To ensure
things work in this form (something we suspected in cockroachdb#76331), we add a
test.

NB: The reason things still work is because in cockroachdb#74555 we modified the
system config span to hard code the relevant config fields for constant
system table IDs -- behaving identically to previous version nodes.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants