Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: v23.1.2-custom: failed to decode tenant prefix from key for replica: insufficient bytes to decode uvarint value #104606

Closed
cockroach-teamcity opened this issue Jun 8, 2023 · 4 comments
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 8, 2023

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://cockroach-labs.sentry.io/issues/4236862053/?referrer=webhooks_plugin

Panic message:

replica_init.go:371: log.Fatal: failed to decode tenant prefix from key for replica [n1,s1,r&{5708/1:{-} × ×}]: insufficient bytes to decode uvarint value: ×
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:371
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).initRaftMuLockedReplicaMuLocked
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:213
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.prepareRightReplicaForSplit
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:257
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.splitPostApply
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:181
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleSplitResult
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:299
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:320
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:187
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:210
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:295
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1009
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:722
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:646
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:395
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:302
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (2) secondary error attachment
| insufficient bytes to decode uvarint value: ×
| (1) attached stack trace
| -- stack trace:
| | github.com/cockroachdb/cockroach/pkg/util/encoding.DecodeUvarintAscending
| | github.com/cockroachdb/cockroach/pkg/util/encoding/encoding.go:546
| | github.com/cockroachdb/cockroach/pkg/keys.DecodeTenantPrefix
| | github.com/cockroachdb/cockroach/pkg/keys/sql.go:39
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:369
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).initRaftMuLockedReplicaMuLocked
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:213
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.prepareRightReplicaForSplit
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:257
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.splitPostApply
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:181
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleSplitResult
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:299
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:320
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:187
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:210
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:295
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1009
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:722
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:646
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:395
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
| | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:302
| | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
| | runtime.goexit
| | GOROOT/src/runtime/asm_amd64.s:1594
| Wraps: (2) insufficient bytes to decode uvarint value: ×
| Error types: (1) *withstack.withStack (2) *errutil.leafError
Wraps: (3) log.Fatal: failed to decode tenant prefix from key for replica [n1,s1,r&{5708/1:{-} × ×}]: insufficient bytes to decode uvarint value: ×
Error types: (1) *withstack.withStack (2) *secondary.withSecondaryError (3) *errutil.leafError
-- report composition:
*errutil.leafError: log.Fatal: failed to decode tenant prefix from key for replica [n1,s1,r&{5708/1:{-} × ×}]: insufficient bytes to decode uvarint value: ×
*secondary.withSecondaryError: details for github.com/cockroachdb/errors/withstack/*withstack.withStack:::
replica_init.go:371: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets):

https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go#L370-L372 in pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go#L212-L214 in pkg/kv/kvserver.(*Replica).initRaftMuLockedReplicaMuLocked
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go#L256-L258 in pkg/kv/kvserver.prepareRightReplicaForSplit
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go#L180-L182 in pkg/kv/kvserver.splitPostApply
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go#L298-L300 in pkg/kv/kvserver.(*Replica).handleSplitResult
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L319-L321 in pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L186-L188 in pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects

curChecked := iter.CurChecked()
applied, err := fn(curChecked.Ctx(), curChecked)
if err != nil {
in pkg/kv/kvserver/apply.mapCheckedCmdIter
// Apply the side-effects of each command to the state machine.
appliedIter, err := mapCheckedCmdIter(stagedIter, t.sm.ApplySideEffects)
if err != nil {
in pkg/kv/kvserver/apply.(*Task).applyOneBatch
for iter.Valid() {
if err := t.applyOneBatch(ctx, iter); err != nil {
// If the batch threw an error, reject all remaining commands in the
in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1008-L1010 in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L721-L723 in pkg/kv/kvserver.(*Replica).handleRaftReady
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L645-L647 in pkg/kv/kvserver.(*Store).processReady
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L394-L396 in pkg/kv/kvserver.(*raftSchedulerShard).worker
https://github.com/cockroachdb/cockroach/blob/429af1379325bf0521dde42dd284b2aaf6d4282c/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L301-L303 in pkg/kv/kvserver.(*raftScheduler).Start.func2
sp.UpdateGoroutineIDToCurrent()
f(ctx)
}()
in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
GOROOT/src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexit

pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go in pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked at line 371
pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go in pkg/kv/kvserver.(*Replica).initRaftMuLockedReplicaMuLocked at line 213
pkg/kv/kvserver/pkg/kv/kvserver/store_split.go in pkg/kv/kvserver.prepareRightReplicaForSplit at line 257
pkg/kv/kvserver/pkg/kv/kvserver/store_split.go in pkg/kv/kvserver.splitPostApply at line 181
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go in pkg/kv/kvserver.(*Replica).handleSplitResult at line 299
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult at line 320
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects at line 187
pkg/kv/kvserver/apply/cmd.go in pkg/kv/kvserver/apply.mapCheckedCmdIter at line 210
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).applyOneBatch at line 295
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries at line 251
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked at line 1009
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReady at line 722
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processReady at line 646
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftSchedulerShard).worker at line 395
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftScheduler).Start.func2 at line 302
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 470
GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1594
Tag Value
Cockroach Release v23.1.2-224-g429af137932
Cockroach SHA: 429af13
Platform linux amd64
Distribution CCL
Environment v23.1.3
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-28626

@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Jun 8, 2023
@yuzefovich yuzefovich changed the title sentry: replica_init.go:371: log.Fatal: failed to decode tenant prefix from key for replica [n1,s1,r&{5708/1:{-} × ×}]: insufficient bytes to decode uvarint value: × (1) attached stack trace -- stack trace:... kvserver: v23.1.2-custom: failed to decode tenant prefix from key for replica: insufficient bytes to decode uvarint value Jun 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 10, 2023
@knz
Copy link
Contributor

knz commented Jun 12, 2023

there's a worry this could be related to #101549 and an incomplete/incorrect fix.

Alternatively, we should also double check this is not a byproduct of #103690.

@nvanbenschoten
Copy link
Member

Alternatively, we should also double check this is not a byproduct of #103690.

That PR was not backported until last week. It was not included in v23.1.2, which is one of the places where we see this failure.

@kvoli kvoli added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Jun 13, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jun 14, 2023

The invalid tenant prefix should be caught in adminSplitWithDescriptor: #104796

We don't know where the split came from, however logging from splunk shows it wasn't from the splitQueue - as the bad split points were manual.

#104802 is a fix to validate tenant ID of the split key. Which should remove the release blocker once backported to 23.1 #104850.

@kvoli
Copy link
Collaborator

kvoli commented Jun 14, 2023

Closed by #104850

@kvoli kvoli closed this as completed Jun 14, 2023
arulajmani added a commit to arulajmani/cockroach that referenced this issue Jun 14, 2023
Normally, `MakeTenantPrefix(ID).PrefixEnd()` is the same as
`MakeTenantPrefix(ID + 1)`. However, this isn't true if the ID
corresponds to one of the boundary points in `EncodeUvarintAscending`.
Prior to this patch, various places in the code did not appreciate this
subtlety. This proved problematic when we started creating split points
at `MakeTenantPrefix(ID).PrefixEnd()`. Now, if the tenant ID was equal
to one of these boundary points (e.g. 109), decoding the key
`MakeTenantPrefix(ID).PrefixEnd()` does not return the expected tenant
ID (in our case, 110). Instead, it results in an error.

Worse yet, various places in KV assume doing such tenant decoding at a
range's start key is kosher. We've since disallowed KV from accepting
such splits in cockroachdb#104802.
This patch fixes the root cause of why these splits were being issued
in the first place -- over in
cockroachdb@ac54eba,
we started creating spilt points at the end of a tenant's keyspace
(in addition to the start of it). We did so using `PrefixEnd()`, which
we have since learned is not what we wanted here.

In addition to the initial split point, we also fix the span config
records we seed during tenant creation. We've also added a test to
ensure all split points created during cluster creation are kosher.
The test uses randomized tenant IDs -- I confirmed that it fails with
tenant ID = 109 (without my changes).

Lastly, there's a bit more auditing work that needs to be done here
about these assumptions. That's captured in a followup issue
cockroachdb#104928.

Prevents cockroachdb#104606 from
happening.

Release note (bug fix): Fixes a bug where tenant creation for certain
IDs would always fail because of invalid split points. Additionally,
such tenant creation failure could leave the host cluster's span
config state entirely busted -- we prevent that as well.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Jun 15, 2023
Normally, `MakeTenantPrefix(ID).PrefixEnd()` is the same as
`MakeTenantPrefix(ID + 1)`. However, this isn't true if the ID
corresponds to one of the boundary points in `EncodeUvarintAscending`.
Prior to this patch, various places in the code did not appreciate this
subtlety. This proved problematic when we started creating split points
at `MakeTenantPrefix(ID).PrefixEnd()`. Now, if the tenant ID was equal
to one of these boundary points (e.g. 109), decoding the key
`MakeTenantPrefix(ID).PrefixEnd()` does not return the expected tenant
ID (in our case, 110). Instead, it results in an error.

Worse yet, various places in KV assume doing such tenant decoding at a
range's start key is kosher. We've since disallowed KV from accepting
such splits in cockroachdb#104802.
This patch fixes the root cause of why these splits were being issued
in the first place -- over in
cockroachdb@ac54eba,
we started creating spilt points at the end of a tenant's keyspace
(in addition to the start of it). We did so using `PrefixEnd()`, which
we have since learned is not what we wanted here.

In addition to the initial split point, we also fix the span config
records we seed during tenant creation. We've also added a test to
ensure all split points created during cluster creation are kosher.
The test uses randomized tenant IDs -- I confirmed that it fails with
tenant ID = 109 (without my changes).

Lastly, there's a bit more auditing work that needs to be done here
about these assumptions. That's captured in a followup issue
cockroachdb#104928.

Prevents cockroachdb#104606 from
happening.

Release note (bug fix): Fixes a bug where tenant creation for certain
IDs would always fail because of invalid split points. Additionally,
such tenant creation failure could leave the host cluster's span
config state entirely busted -- we prevent that as well.
craig bot pushed a commit that referenced this issue Jun 15, 2023
104920: sql: refrain from performing invalid splits during tenant creation  r=nvanbenschoten,ecwall a=arulajmani

Normally, `MakeTenantPrefix(ID).PrefixEnd()` is the same as
`MakeTenantPrefix(ID + 1)`. However, this isn't true if the ID
corresponds to one of the boundary points in `EncodeUvarintAscending`.
Prior to this patch, various places in the code did not appreciate this
subtlety. This proved problematic when we started creating split points
at `MakeTenantPrefix(ID).PrefixEnd()`. Now, if the tenant ID was equal
to one of these boundary points (e.g. 109), decoding the key
`MakeTenantPrefix(ID).PrefixEnd()` does not return the expected tenant
ID (in our case, 110). Instead, it results in an error.

Worse yet, various places in KV assume doing such tenant decoding at a
range's start key is kosher. We've since disallowed KV from accepting
such splits in #104802.
This patch fixes the root cause of why these splits were being issued
in the first place -- over in
ac54eba,
we started creating spilt points at the end of a tenant's keyspace
(in addition to the start of it). We did so using `PrefixEnd()`, which
we have since learned is not what we wanted here.

In addition to the initial split point, we also fix the span config
records we seed during tenant creation. We've also added a test to
ensure all split points created during cluster creation are kosher.
The test uses randomized tenant IDs -- I confirmed that it fails with
tenant ID = 109 (without my changes).

Lastly, there's a bit more auditing work that needs to be done here
about these assumptions. That's captured in a followup issue
#104928.

Prevents #104606 from
happening.

Release note (bug fix): Fixes a bug where tenant creation for certain
IDs would always fail because of invalid split points. Additionally,
such tenant creation failure could leave the host cluster's span
config state entirely busted -- we prevent that as well.

Co-authored-by: Arul Ajmani <[email protected]>
arulajmani added a commit to arulajmani/cockroach that referenced this issue Jun 15, 2023
Normally, `MakeTenantPrefix(ID).PrefixEnd()` is the same as
`MakeTenantPrefix(ID + 1)`. However, this isn't true if the ID
corresponds to one of the boundary points in `EncodeUvarintAscending`.
Prior to this patch, various places in the code did not appreciate this
subtlety. This proved problematic when we started creating split points
at `MakeTenantPrefix(ID).PrefixEnd()`. Now, if the tenant ID was equal
to one of these boundary points (e.g. 109), decoding the key
`MakeTenantPrefix(ID).PrefixEnd()` does not return the expected tenant
ID (in our case, 110). Instead, it results in an error.

Worse yet, various places in KV assume doing such tenant decoding at a
range's start key is kosher. We've since disallowed KV from accepting
such splits in cockroachdb#104802.
This patch fixes the root cause of why these splits were being issued
in the first place -- over in
cockroachdb@ac54eba,
we started creating spilt points at the end of a tenant's keyspace
(in addition to the start of it). We did so using `PrefixEnd()`, which
we have since learned is not what we wanted here.

In addition to the initial split point, we also fix the span config
records we seed during tenant creation. We've also added a test to
ensure all split points created during cluster creation are kosher.
The test uses randomized tenant IDs -- I confirmed that it fails with
tenant ID = 109 (without my changes).

Lastly, there's a bit more auditing work that needs to be done here
about these assumptions. That's captured in a followup issue
cockroachdb#104928.

Prevents cockroachdb#104606 from
happening.

Release note (bug fix): Fixes a bug where tenant creation for certain
IDs would always fail because of invalid split points. Additionally,
such tenant creation failure could leave the host cluster's span
config state entirely busted -- we prevent that as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants