server: properly check existence of tenant record #106414

knz · 2023-07-07T17:15:53Z

PR extracted from #105441.

Informs #83650.
Part of #98431.
Epic: CRDB-26691

Prior to this patch, the API on the tenant info watcher would return
an "undefined" metadata payload if called prior to the tenant record
being created.

This was most visible in the following scenario:

new cluster starts. watcher+rangefeed start successfully (tenant
table empty)
tenant client connects. At this time there is no metadata for
its tenant ID, so the metadata payload is available but empty.
CREATE TENANT is executed for the new tenant.
only then (later) the rangefeed introduces the metadata
into the cache.

This is insufficient for use by the KV tenant client RPCs: there we
only want to accept incoming requests from tenant clients after we
actually have seen metadata for them.

This patch improves the situation by checking whether the tenant
record is present before returning bogus data to the SQL tenant
client.

Simultaneously, we add error handling logic in the SQL tenant client
to handle this case gracefully.

In a mixed-version deployment (with and without this patch applied),
the following behaviors will occur if one starts the SQL tenant server
without a tenant record defined:

Unpatched server: Late startup error in client with "database 1 does not exist".
Patched server: Client loops forever, waiting for tenant record.

Behavior when the tenant record is created shortly after the SQL
tenant server starts up:

Unpatched server: Inconsistent / non-deterministic behavior.
Patched server: Clean startup of client after tenant record appears.

knz · 2023-07-07T17:16:05Z

Internal discussion: https://cockroachlabs.slack.com/archives/C02HWA24541/p1688559273013399

cockroach-teamcity · 2023-07-07T17:16:11Z

This change is

stevendanna

Thanks for pushing this forward. I gave it an initial read through and left some minor comments.

stevendanna · 2023-07-10T07:55:38Z

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go

-		log.Fatalf(ctx, "unknown update type: %v", u.Type)
+		err := errors.AssertionFailedf("unknown update type: %v", u.Type)
+		logcrash.ReportOrPanic(ctx, &w.st.SV, "%w", err)
+		log.Warningf(ctx, "%v", err)


ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?

stevendanna · 2023-07-10T08:00:25Z

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go

+			rawDataState := tree.MustBeDInt(i)
+			if rawDataState >= 0 && rawDataState <= tree.DInt(mtinfopb.MaxDataState) {
+				dataState = mtinfopb.TenantDataState(rawDataState)
+			} else {


I could see this kind of validation being pushed into a constructor method in mtinfopb. I don't feel too strongly either way though.

Adding a bit more here. I meant something like:

dataState, err := mtinfopb.NewDataState(rawDataState) if err != nil { log.Warning... }

Where NewDataState is where the bounds check lives.

But, I don't feel strongly about that.

stevendanna · 2023-07-10T08:09:26Z

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher_test.go

 				if !found {
 					return "not-found"
 				}
-				return fmt.Sprintf("%v", tenantcapabilitiestestutils.AlteredCapabilitiesString(cp))
+				var buf strings.Builder
+				fmt.Fprintf(&buf, "%+v\n", pretty.Formatter(info))


Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.

stevendanna · 2023-07-10T08:16:41Z

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go

+	// Fast path check.
+	select {
+	case <-w.startCh:
+		return w.startErr
+	default:
+	}


The rest of this function doesn't look too onerous, did we see problems without this fast path?

stevendanna · 2023-07-10T08:25:15Z

pkg/server/tenantsettingswatcher/watcher.go

@@ -84,8 +84,8 @@ func New(
 // canceled or the stopper is stopped prior to the initial data being retrieved.
 func (w *Watcher) Start(ctx context.Context, sysTableResolver catalog.SystemTableIDResolver) error {
 	w.startCh = make(chan struct{})
+	defer func() { close(w.startCh) }()


[nit] No need for the func wrapper here I believe.

stevendanna · 2023-07-10T08:28:55Z

pkg/server/node.go

 // TenantSettings implements the kvpb.InternalServer interface.
 func (n *Node) TenantSettings(
 	args *kvpb.TenantSettingsRequest, stream kvpb.Internal_TenantSettingsServer,
 ) error {
 	ctx := n.storeCfg.AmbientCtx.AnnotateCtx(stream.Context())
 	ctxDone := ctx.Done()

-	w := n.tenantSettingsWatcher
-	if err := w.WaitForStart(ctx); err != nil {
+	w, w2, err := n.waitForTenantWatcherReadiness(ctx)


Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, and @yuzefovich)

pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):
@yuzefovich asks

I don't quite understand this block. runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce, so why do we need first and second cases? Also why don't we just exit when the cancellation / quiesce happens? Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?

pkg/kv/kvpb/api.proto line 3138 at r4 (raw file):
@yuzefovich asks:

The second sentence in this paragraph to me implies that creators of TenantSettingsEvent must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which point TenantSettingsEvent can be created in any manner, contradicting any of the 3 points. Is that correct?

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 130 at r2 (raw file):
@yuzefovich asks:

nit: it seems like nil settings is only because of TestingDecoderFn which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)

pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):

runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce

I hadn't seen that. Simplified.

why don't we just exit when the cancellation / quiesce happens?

Oversight. Fixed.

Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?

This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.

pkg/kv/kvpb/api.proto line 3138 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@yuzefovich asks:

The second sentence in this paragraph to me implies that creators of TenantSettingsEvent must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which point TenantSettingsEvent can be created in any manner, contradicting any of the 3 points. Is that correct?

Indeed. Clarified in comment.

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 130 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@yuzefovich asks:

nit: it seems like nil settings is only because of TestingDecoderFn which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.

Done.

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 162 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?

Done.

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 237 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

The rest of this function doesn't look too onerous, did we see problems without this fast path?

I copied-and-pasted this from server/tenantsettingswatcher. simplified.

pkg/server/node.go line 2059 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.

Let's do it.

pkg/server/tenantsettingswatcher/watcher.go line 87 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit] No need for the func wrapper here I believe.

again, this was a copy paste. simplified.

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher_test.go line 251 at r3 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.

Done.

yuzefovich

Nice work! although it'd be probably worth it to wait for Steven's approval too.

Reviewed 2 of 7 files at r2, 36 of 36 files at r6, 19 of 19 files at r7, 14 of 14 files at r8, 20 of 20 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)

pkg/kv/kvclient/kvtenant/connector.go line 963 at r9 (raw file):

	defer c.mu.Unlock()
	if c.mu.client == client {
		err := c.mu.client.conn.Close() // nolint:grpcconnclose

nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?

pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce

I hadn't seen that. Simplified.

why don't we just exit when the cancellation / quiesce happens?

Oversight. Fixed.

Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?

This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.

Makes sense, thanks.

pkg/server/settingswatcher/settings_watcher.go line 258 at r6 (raw file):

		err = errors.NewAssertionErrorWithWrappedErrf(err, "failed to decode settings row %v", kv.Key)
		logcrash.ReportOrPanic(ctx, &s.settings.SV, "%w", err)
		log.Warningf(ctx, "%v", err)

nit: ditto for removing redundant logging.

knz

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)

pkg/kv/kvclient/kvtenant/connector.go line 963 at r9 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?

Not a bad idea! Moving this change to #106576 for discussion.

knz

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)

pkg/server/settingswatcher/settings_watcher.go line 258 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: ditto for removing redundant logging.

Done.

knz

Steven, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 115 at r3 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Adding a bit more here. I meant something like:
dataState, err := mtinfopb.NewDataState(rawDataState)
if err != nil {
    log.Warning...
}
Where NewDataState is where the bounds check lives.

But, I don't feel strongly about that.

Thanks, I added a commit to do that.

yuzefovich

Reviewed 33 of 33 files at r10, 13 of 13 files at r11, 9 of 9 files at r12, 15 of 15 files at r13, 7 of 7 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 67 at r14 (raw file):

	// The remaining columns are stored in the value; we're just interested in the
	// info column.

nit: "just info column" seems out of date now.

Release note: None

This patch extends the infrastructure originally designed to track only the capabilities field of tenant records, into a general-purpose cache for more fields in each tenant record. In this patch, it is extended to also track changes to the tenant name, service mode and data state. Because of this change, all the APIs and package names are becoming incorrect for their purpose (the focus is not on just capabilities any more). The work to rename them will happen in a later commit. Release note: None

Context: This change is part of a larger epic to make SQL servers aware of more of their tenant metadata (incl service mode, data state, capabilities and tenant name). There are two moving pieces in this commit: - an extension to the `TenantSettings` streaming RPC endpoint. The vision here is to piggy-back on the existing `TenantSettings`, which so far was only used to communicate tenant setting overrides, to also communicate other tenant metadata. We are careful to ensure the new data appears as a no-op from the perspective of previous-version clients of the RPC. - a channel-based synchronization protocol to ensure that the streaming endpoint `TenantSettings` gets activated precisely when there are row updates in the tenants table. Here we mirror one-to-one the synchronization logic already present in the "tenant settings watcher" (`pkg/server/tenantsettingswatcher/watcher.go`). The reviewer can satisfy themselves the logic is identical by putting the source code for both watchers side-by-side. As in the previous commit, the extension in purpose means that the various things involved here are not named properly any more; - the "capabilities watcher" is hardly focused on capabilities any more. - the "TenantSettings" endpoint (and the request/response types) are hardly focused on settings any more. Later work will take care of renaming these various components to properly reflect their generalized purpose. Release note: None

Prior to this patch, the API on the tenant info watcher would return an "undefined" metadata payload if called prior to the tenant record being created. This was most visible in the following scenario: 1. new cluster starts. watcher+rangefeed start successfully (tenant table empty) 2. tenant client connects. At this time there is no metadata for its tenant ID, so the metadata payload is available but empty. 3. CREATE TENANT is executed for the new tenant. 4. only then (later) the rangefeed introduces the metadata into the cache. This is insufficient for use by the KV tenant client RPCs: there we only want to accept incoming requests from tenant clients after we actually have seen metadata for them. This patch improves the situation by checking whether the tenant record is present before returning bogus data to the SQL tenant client. Simultaneously, we add error handling logic in the SQL tenant client to handle this case gracefully. In a mixed-version deployment (with and without this patch applied), the following behaviors will occur if one starts the SQL tenant server without a tenant record defined: - Unpatched server: Late startup error in client with "database 1 does not exist". - Patched server: Client loops forever, waiting for tenant record. Behavior when the tenant record is created shortly *after* the SQL tenant server starts up: - Unpatched server: Inconsistent / non-deterministic behavior. - Patched server: Clean startup of client after tenant record appears. Release note: None

Release note: None

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 67 at r14 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "just info column" seems out of date now.

Fixed.

knz · 2023-07-23T14:21:33Z

TFYR!

bors r=yuzefovich,stevendanna

craig · 2023-07-23T15:04:32Z

Build succeeded:

Bazel Essential CI (Cockroach)

healthy-pod · 2023-07-27T18:28:10Z

After this PR was merged, the multitenant-upgrade started hanging due to:

net/http/server.go:3228 ⋮ [-] 528  ‹http: TLS handshake error from redacted_ip:redacted_port: remote error: tls: bad certificate›

knz · 2023-07-28T20:00:58Z

followup: #107824

knz requested review from stevendanna and yuzefovich July 7, 2023 17:15

knz requested review from a team as code owners July 7, 2023 17:15

knz force-pushed the 20230702-tenant-check-record branch from 166ede8 to ef93182 Compare July 7, 2023 19:05

knz requested a review from a team as a code owner July 7, 2023 19:05

knz requested review from srosenberg and renatolabs and removed request for a team July 7, 2023 19:05

knz force-pushed the 20230702-tenant-check-record branch 3 times, most recently from 23639de to 1bbcf3f Compare July 7, 2023 20:31

knz requested review from a team and nkodali and removed request for a team July 7, 2023 20:31

knz force-pushed the 20230702-tenant-check-record branch from 1bbcf3f to 00213f1 Compare July 7, 2023 20:33

stevendanna reviewed Jul 10, 2023

View reviewed changes

knz mentioned this pull request Jul 10, 2023

server: notify tenant servers of metadata changes #105441

Merged

knz commented Jul 10, 2023

View reviewed changes

knz force-pushed the 20230702-tenant-check-record branch from 00213f1 to d5e48ae Compare July 10, 2023 13:29

knz commented Jul 10, 2023

View reviewed changes

knz force-pushed the 20230702-tenant-check-record branch 2 times, most recently from 10ca28b to cfd1204 Compare July 10, 2023 14:41

knz requested a review from stevendanna July 10, 2023 14:45

knz added the A-multitenancy Related to multi-tenancy label Jul 10, 2023

yuzefovich approved these changes Jul 10, 2023

View reviewed changes

knz mentioned this pull request Jul 11, 2023

kvtenant: properly release resources when reconnecting #106576

Draft

knz commented Jul 11, 2023

View reviewed changes

knz force-pushed the 20230702-tenant-check-record branch from cfd1204 to 0039c66 Compare July 11, 2023 13:32

knz commented Jul 11, 2023

View reviewed changes

knz force-pushed the 20230702-tenant-check-record branch 2 times, most recently from ffed569 to d9e57d7 Compare July 11, 2023 17:39

yuzefovich approved these changes Jul 12, 2023

View reviewed changes

knz added 5 commits July 23, 2023 16:15

tenantcapabilitieswatcher: avoid log.Fatal

1271e4c

Release note: None

sql,multitenant: new package 'mtinfo' to share SQL decoding logic

35ea6d5

Release note: None

knz commented Jul 23, 2023

View reviewed changes

knz force-pushed the 20230702-tenant-check-record branch from d9e57d7 to 35ea6d5 Compare July 23, 2023 14:21

craig bot merged commit 164c8cb into cockroachdb:master Jul 23, 2023

knz deleted the 20230702-tenant-check-record branch July 23, 2023 15:33

knz mentioned this pull request Jul 28, 2023

roachtest: multitenant-upgrade failed #107564

Closed

knz mentioned this pull request Jul 28, 2023

server: use in-memory cache of tenant info in server controller #107827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: properly check existence of tenant record #106414

server: properly check existence of tenant record #106414

knz commented Jul 7, 2023 •

edited by yuzefovich

Loading

knz commented Jul 7, 2023

cockroach-teamcity commented Jul 7, 2023

stevendanna left a comment

stevendanna Jul 10, 2023

stevendanna Jul 10, 2023

stevendanna Jul 11, 2023

stevendanna Jul 10, 2023

stevendanna Jul 10, 2023

stevendanna Jul 10, 2023

stevendanna Jul 10, 2023

knz left a comment

knz left a comment

yuzefovich left a comment

knz left a comment

knz left a comment

knz left a comment

yuzefovich left a comment

knz left a comment

knz commented Jul 23, 2023

craig bot commented Jul 23, 2023

healthy-pod commented Jul 27, 2023 •

edited

Loading

knz commented Jul 28, 2023

server: properly check existence of tenant record #106414

server: properly check existence of tenant record #106414

Conversation

knz commented Jul 7, 2023 • edited by yuzefovich Loading

knz commented Jul 7, 2023

cockroach-teamcity commented Jul 7, 2023

stevendanna left a comment

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

stevendanna Jul 11, 2023

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

stevendanna Jul 10, 2023

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Jul 23, 2023

craig bot commented Jul 23, 2023

healthy-pod commented Jul 27, 2023 • edited Loading

knz commented Jul 28, 2023

knz commented Jul 7, 2023 •

edited by yuzefovich

Loading

healthy-pod commented Jul 27, 2023 •

edited

Loading