-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: properly check existence of tenant record #106414
Conversation
Internal discussion: https://cockroachlabs.slack.com/archives/C02HWA24541/p1688559273013399 |
166ede8
to
ef93182
Compare
23639de
to
1bbcf3f
Compare
1bbcf3f
to
00213f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward. I gave it an initial read through and left some minor comments.
log.Fatalf(ctx, "unknown update type: %v", u.Type) | ||
err := errors.AssertionFailedf("unknown update type: %v", u.Type) | ||
logcrash.ReportOrPanic(ctx, &w.st.SV, "%w", err) | ||
log.Warningf(ctx, "%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?
rawDataState := tree.MustBeDInt(i) | ||
if rawDataState >= 0 && rawDataState <= tree.DInt(mtinfopb.MaxDataState) { | ||
dataState = mtinfopb.TenantDataState(rawDataState) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this kind of validation being pushed into a constructor method in mtinfopb. I don't feel too strongly either way though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a bit more here. I meant something like:
dataState, err := mtinfopb.NewDataState(rawDataState)
if err != nil {
log.Warning...
}
Where NewDataState
is where the bounds check lives.
But, I don't feel strongly about that.
if !found { | ||
return "not-found" | ||
} | ||
return fmt.Sprintf("%v", tenantcapabilitiestestutils.AlteredCapabilitiesString(cp)) | ||
var buf strings.Builder | ||
fmt.Fprintf(&buf, "%+v\n", pretty.Formatter(info)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.
// Fast path check. | ||
select { | ||
case <-w.startCh: | ||
return w.startErr | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of this function doesn't look too onerous, did we see problems without this fast path?
@@ -84,8 +84,8 @@ func New( | |||
// canceled or the stopper is stopped prior to the initial data being retrieved. | |||
func (w *Watcher) Start(ctx context.Context, sysTableResolver catalog.SystemTableIDResolver) error { | |||
w.startCh = make(chan struct{}) | |||
defer func() { close(w.startCh) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] No need for the func wrapper here I believe.
pkg/server/node.go
Outdated
// TenantSettings implements the kvpb.InternalServer interface. | ||
func (n *Node) TenantSettings( | ||
args *kvpb.TenantSettingsRequest, stream kvpb.Internal_TenantSettingsServer, | ||
) error { | ||
ctx := n.storeCfg.AmbientCtx.AnnotateCtx(stream.Context()) | ||
ctxDone := ctx.Done() | ||
|
||
w := n.tenantSettingsWatcher | ||
if err := w.WaitForStart(ctx); err != nil { | ||
w, w2, err := n.waitForTenantWatcherReadiness(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, and @yuzefovich)
pkg/kv/kvclient/kvtenant/setting_overrides.go
line 75 at r5 (raw file):
@yuzefovich asks
I don't quite understand this block.
runTenantSettingsSubscription
is called in a single spot in which we derive a child context that is canceled on quiesce, so why do we need first and secondcase
s? Also why don't we just exit when the cancellation / quiesce happens? Why don't we exit when non-io.EOF
error (other thanMissingRecordError
) isRecv
ed?
pkg/kv/kvpb/api.proto
line 3138 at r4 (raw file):
@yuzefovich asks:
The second sentence in this paragraph to me implies that creators of
TenantSettingsEvent
must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which pointTenantSettingsEvent
can be created in any manner, contradicting any of the 3 points. Is that correct?
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go
line 130 at r2 (raw file):
@yuzefovich asks:
nit: it seems like
nil
settings is only because ofTestingDecoderFn
which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.
00213f1
to
d5e48ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)
pkg/kv/kvclient/kvtenant/setting_overrides.go
line 75 at r5 (raw file):
runTenantSettingsSubscription
is called in a single spot in which we derive a child context that is canceled on quiesce
I hadn't seen that. Simplified.
why don't we just exit when the cancellation / quiesce happens?
Oversight. Fixed.
Why don't we exit when non-
io.EOF
error (other thanMissingRecordError
) isRecv
ed?
This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.
pkg/kv/kvpb/api.proto
line 3138 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
@yuzefovich asks:
The second sentence in this paragraph to me implies that creators of
TenantSettingsEvent
must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which pointTenantSettingsEvent
can be created in any manner, contradicting any of the 3 points. Is that correct?
Indeed. Clarified in comment.
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go
line 130 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
@yuzefovich asks:
nit: it seems like
nil
settings is only because ofTestingDecoderFn
which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.
Done.
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go
line 162 at r2 (raw file):
Previously, stevendanna (Steven Danna) wrote…
ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?
Done.
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go
line 237 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
The rest of this function doesn't look too onerous, did we see problems without this fast path?
I copied-and-pasted this from server/tenantsettingswatcher. simplified.
pkg/server/node.go
line 2059 at r5 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.
Let's do it.
pkg/server/tenantsettingswatcher/watcher.go
line 87 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit] No need for the func wrapper here I believe.
again, this was a copy paste. simplified.
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher_test.go
line 251 at r3 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.
Done.
10ca28b
to
cfd1204
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! although it'd be probably worth it to wait for Steven's approval too.
Reviewed 2 of 7 files at r2, 36 of 36 files at r6, 19 of 19 files at r7, 14 of 14 files at r8, 20 of 20 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)
pkg/kv/kvclient/kvtenant/connector.go
line 963 at r9 (raw file):
defer c.mu.Unlock() if c.mu.client == client { err := c.mu.client.conn.Close() // nolint:grpcconnclose
nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?
pkg/kv/kvclient/kvtenant/setting_overrides.go
line 75 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
runTenantSettingsSubscription
is called in a single spot in which we derive a child context that is canceled on quiesceI hadn't seen that. Simplified.
why don't we just exit when the cancellation / quiesce happens?
Oversight. Fixed.
Why don't we exit when non-
io.EOF
error (other thanMissingRecordError
) isRecv
ed?This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.
Makes sense, thanks.
pkg/server/settingswatcher/settings_watcher.go
line 258 at r6 (raw file):
err = errors.NewAssertionErrorWithWrappedErrf(err, "failed to decode settings row %v", kv.Key) logcrash.ReportOrPanic(ctx, &s.settings.SV, "%w", err) log.Warningf(ctx, "%v", err)
nit: ditto for removing redundant logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)
pkg/kv/kvclient/kvtenant/connector.go
line 963 at r9 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?
Not a bad idea! Moving this change to #106576 for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)
pkg/server/settingswatcher/settings_watcher.go
line 258 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: ditto for removing redundant logging.
Done.
cfd1204
to
0039c66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Steven, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go
line 115 at r3 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Adding a bit more here. I meant something like:
dataState, err := mtinfopb.NewDataState(rawDataState) if err != nil { log.Warning... }Where
NewDataState
is where the bounds check lives.But, I don't feel strongly about that.
Thanks, I added a commit to do that.
ffed569
to
d9e57d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 33 of 33 files at r10, 13 of 13 files at r11, 9 of 9 files at r12, 15 of 15 files at r13, 7 of 7 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go
line 67 at r14 (raw file):
// The remaining columns are stored in the value; we're just interested in the // info column.
nit: "just info column" seems out of date now.
Release note: None
This patch extends the infrastructure originally designed to track only the capabilities field of tenant records, into a general-purpose cache for more fields in each tenant record. In this patch, it is extended to also track changes to the tenant name, service mode and data state. Because of this change, all the APIs and package names are becoming incorrect for their purpose (the focus is not on just capabilities any more). The work to rename them will happen in a later commit. Release note: None
Context: This change is part of a larger epic to make SQL servers aware of more of their tenant metadata (incl service mode, data state, capabilities and tenant name). There are two moving pieces in this commit: - an extension to the `TenantSettings` streaming RPC endpoint. The vision here is to piggy-back on the existing `TenantSettings`, which so far was only used to communicate tenant setting overrides, to also communicate other tenant metadata. We are careful to ensure the new data appears as a no-op from the perspective of previous-version clients of the RPC. - a channel-based synchronization protocol to ensure that the streaming endpoint `TenantSettings` gets activated precisely when there are row updates in the tenants table. Here we mirror one-to-one the synchronization logic already present in the "tenant settings watcher" (`pkg/server/tenantsettingswatcher/watcher.go`). The reviewer can satisfy themselves the logic is identical by putting the source code for both watchers side-by-side. As in the previous commit, the extension in purpose means that the various things involved here are not named properly any more; - the "capabilities watcher" is hardly focused on capabilities any more. - the "TenantSettings" endpoint (and the request/response types) are hardly focused on settings any more. Later work will take care of renaming these various components to properly reflect their generalized purpose. Release note: None
Prior to this patch, the API on the tenant info watcher would return an "undefined" metadata payload if called prior to the tenant record being created. This was most visible in the following scenario: 1. new cluster starts. watcher+rangefeed start successfully (tenant table empty) 2. tenant client connects. At this time there is no metadata for its tenant ID, so the metadata payload is available but empty. 3. CREATE TENANT is executed for the new tenant. 4. only then (later) the rangefeed introduces the metadata into the cache. This is insufficient for use by the KV tenant client RPCs: there we only want to accept incoming requests from tenant clients after we actually have seen metadata for them. This patch improves the situation by checking whether the tenant record is present before returning bogus data to the SQL tenant client. Simultaneously, we add error handling logic in the SQL tenant client to handle this case gracefully. In a mixed-version deployment (with and without this patch applied), the following behaviors will occur if one starts the SQL tenant server without a tenant record defined: - Unpatched server: Late startup error in client with "database 1 does not exist". - Patched server: Client loops forever, waiting for tenant record. Behavior when the tenant record is created shortly *after* the SQL tenant server starts up: - Unpatched server: Inconsistent / non-deterministic behavior. - Patched server: Clean startup of client after tenant record appears. Release note: None
Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go
line 67 at r14 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "just info column" seems out of date now.
Fixed.
d9e57d7
to
35ea6d5
Compare
TFYR! bors r=yuzefovich,stevendanna |
Build succeeded: |
After this PR was merged, the multitenant-upgrade started hanging due to:
|
followup: #107824 |
PR extracted from #105441.
Informs #83650.
Part of #98431.
Epic: CRDB-26691
Prior to this patch, the API on the tenant info watcher would return
an "undefined" metadata payload if called prior to the tenant record
being created.
This was most visible in the following scenario:
table empty)
its tenant ID, so the metadata payload is available but empty.
into the cache.
This is insufficient for use by the KV tenant client RPCs: there we
only want to accept incoming requests from tenant clients after we
actually have seen metadata for them.
This patch improves the situation by checking whether the tenant
record is present before returning bogus data to the SQL tenant
client.
Simultaneously, we add error handling logic in the SQL tenant client
to handle this case gracefully.
In a mixed-version deployment (with and without this patch applied),
the following behaviors will occur if one starts the SQL tenant server
without a tenant record defined:
Behavior when the tenant record is created shortly after the SQL
tenant server starts up: