Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: properly check existence of tenant record #106414

Merged
merged 5 commits into from
Jul 23, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Jul 7, 2023

PR extracted from #105441.

Informs #83650.
Part of #98431.
Epic: CRDB-26691

Prior to this patch, the API on the tenant info watcher would return
an "undefined" metadata payload if called prior to the tenant record
being created.

This was most visible in the following scenario:

  1. new cluster starts. watcher+rangefeed start successfully (tenant
    table empty)
  2. tenant client connects. At this time there is no metadata for
    its tenant ID, so the metadata payload is available but empty.
  3. CREATE TENANT is executed for the new tenant.
  4. only then (later) the rangefeed introduces the metadata
    into the cache.

This is insufficient for use by the KV tenant client RPCs: there we
only want to accept incoming requests from tenant clients after we
actually have seen metadata for them.

This patch improves the situation by checking whether the tenant
record is present before returning bogus data to the SQL tenant
client.

Simultaneously, we add error handling logic in the SQL tenant client
to handle this case gracefully.

In a mixed-version deployment (with and without this patch applied),
the following behaviors will occur if one starts the SQL tenant server
without a tenant record defined:

  • Unpatched server: Late startup error in client with "database 1 does not exist".
  • Patched server: Client loops forever, waiting for tenant record.

Behavior when the tenant record is created shortly after the SQL
tenant server starts up:

  • Unpatched server: Inconsistent / non-deterministic behavior.
  • Patched server: Clean startup of client after tenant record appears.

@knz knz requested review from stevendanna and yuzefovich July 7, 2023 17:15
@knz knz requested review from a team as code owners July 7, 2023 17:15
@knz
Copy link
Contributor Author

knz commented Jul 7, 2023

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20230702-tenant-check-record branch from 166ede8 to ef93182 Compare July 7, 2023 19:05
@knz knz requested a review from a team as a code owner July 7, 2023 19:05
@knz knz requested review from srosenberg and renatolabs and removed request for a team July 7, 2023 19:05
@knz knz force-pushed the 20230702-tenant-check-record branch 3 times, most recently from 23639de to 1bbcf3f Compare July 7, 2023 20:31
@knz knz requested review from a team and nkodali and removed request for a team July 7, 2023 20:31
@knz knz force-pushed the 20230702-tenant-check-record branch from 1bbcf3f to 00213f1 Compare July 7, 2023 20:33
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward. I gave it an initial read through and left some minor comments.

log.Fatalf(ctx, "unknown update type: %v", u.Type)
err := errors.AssertionFailedf("unknown update type: %v", u.Type)
logcrash.ReportOrPanic(ctx, &w.st.SV, "%w", err)
log.Warningf(ctx, "%v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?

Comment on lines 112 to 114
rawDataState := tree.MustBeDInt(i)
if rawDataState >= 0 && rawDataState <= tree.DInt(mtinfopb.MaxDataState) {
dataState = mtinfopb.TenantDataState(rawDataState)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this kind of validation being pushed into a constructor method in mtinfopb. I don't feel too strongly either way though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a bit more here. I meant something like:

dataState, err := mtinfopb.NewDataState(rawDataState)
if err != nil {
    log.Warning...
}

Where NewDataState is where the bounds check lives.

But, I don't feel strongly about that.

if !found {
return "not-found"
}
return fmt.Sprintf("%v", tenantcapabilitiestestutils.AlteredCapabilitiesString(cp))
var buf strings.Builder
fmt.Fprintf(&buf, "%+v\n", pretty.Formatter(info))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.

Comment on lines 232 to 237
// Fast path check.
select {
case <-w.startCh:
return w.startErr
default:
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of this function doesn't look too onerous, did we see problems without this fast path?

@@ -84,8 +84,8 @@ func New(
// canceled or the stopper is stopped prior to the initial data being retrieved.
func (w *Watcher) Start(ctx context.Context, sysTableResolver catalog.SystemTableIDResolver) error {
w.startCh = make(chan struct{})
defer func() { close(w.startCh) }()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] No need for the func wrapper here I believe.

// TenantSettings implements the kvpb.InternalServer interface.
func (n *Node) TenantSettings(
args *kvpb.TenantSettingsRequest, stream kvpb.Internal_TenantSettingsServer,
) error {
ctx := n.storeCfg.AmbientCtx.AnnotateCtx(stream.Context())
ctxDone := ctx.Done()

w := n.tenantSettingsWatcher
if err := w.WaitForStart(ctx); err != nil {
w, w2, err := n.waitForTenantWatcherReadiness(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, and @yuzefovich)


pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):
@yuzefovich asks

I don't quite understand this block. runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce, so why do we need first and second cases? Also why don't we just exit when the cancellation / quiesce happens? Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?


pkg/kv/kvpb/api.proto line 3138 at r4 (raw file):
@yuzefovich asks:

The second sentence in this paragraph to me implies that creators of TenantSettingsEvent must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which point TenantSettingsEvent can be created in any manner, contradicting any of the 3 points. Is that correct?


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 130 at r2 (raw file):
@yuzefovich asks:

nit: it seems like nil settings is only because of TestingDecoderFn which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.

@knz knz force-pushed the 20230702-tenant-check-record branch from 00213f1 to d5e48ae Compare July 10, 2023 13:29
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)


pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):

runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce

I hadn't seen that. Simplified.

why don't we just exit when the cancellation / quiesce happens?

Oversight. Fixed.

Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?

This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.


pkg/kv/kvpb/api.proto line 3138 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@yuzefovich asks:

The second sentence in this paragraph to me implies that creators of TenantSettingsEvent must always do all 3 points to keep backwards-compatibility. However, IIUC this is only required until the cluster has been upgraded to 23.2, at which point TenantSettingsEvent can be created in any manner, contradicting any of the 3 points. Is that correct?

Indeed. Clarified in comment.


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 130 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@yuzefovich asks:

nit: it seems like nil settings is only because of TestingDecoderFn which is only called in one spot, so I'd refactor the testing fn to pass valid settings object too, in order to remove this non-nil check.

Done.


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 162 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

ReportOrPanic appears to already log the given message at the Error level, given that, is there much value in also logging it as a warning here?

Done.


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 237 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

The rest of this function doesn't look too onerous, did we see problems without this fast path?

I copied-and-pasted this from server/tenantsettingswatcher. simplified.


pkg/server/node.go line 2059 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Perhaps we could use longer variable names here so it is easier to remember which watcher is for what.

Let's do it.


pkg/server/tenantsettingswatcher/watcher.go line 87 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit] No need for the func wrapper here I believe.

again, this was a copy paste. simplified.


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher_test.go line 251 at r3 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Does this command need to print the whole info struct? Seems like the end-result is mostly having to write out the capabilities twice in the testfile.

Done.

@knz knz force-pushed the 20230702-tenant-check-record branch 2 times, most recently from 10ca28b to cfd1204 Compare July 10, 2023 14:41
@knz knz requested a review from stevendanna July 10, 2023 14:45
@knz knz added the A-multitenancy Related to multi-tenancy label Jul 10, 2023
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm: although it'd be probably worth it to wait for Steven's approval too.

Reviewed 2 of 7 files at r2, 36 of 36 files at r6, 19 of 19 files at r7, 14 of 14 files at r8, 20 of 20 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)


pkg/kv/kvclient/kvtenant/connector.go line 963 at r9 (raw file):

	defer c.mu.Unlock()
	if c.mu.client == client {
		err := c.mu.client.conn.Close() // nolint:grpcconnclose

nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?


pkg/kv/kvclient/kvtenant/setting_overrides.go line 75 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

runTenantSettingsSubscription is called in a single spot in which we derive a child context that is canceled on quiesce

I hadn't seen that. Simplified.

why don't we just exit when the cancellation / quiesce happens?

Oversight. Fixed.

Why don't we exit when non-io.EOF error (other than MissingRecordError) is Recved?

This is backward compatibility with current uses. We have roachprod and ORM test suites that require the retry behavior... CI complains loudly when we do the simple exit here, which is why I kept the previous behavior.

Makes sense, thanks.


pkg/server/settingswatcher/settings_watcher.go line 258 at r6 (raw file):

		err = errors.NewAssertionErrorWithWrappedErrf(err, "failed to decode settings row %v", kv.Key)
		logcrash.ReportOrPanic(ctx, &s.settings.SV, "%w", err)
		log.Warningf(ctx, "%v", err)

nit: ditto for removing redundant logging.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)


pkg/kv/kvclient/kvtenant/connector.go line 963 at r9 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should closing the connection be in a separate commit that we want to backport? Or do you plan to backport the whole PR anyway?

Not a bad idea! Moving this change to #106576 for discussion.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)


pkg/server/settingswatcher/settings_watcher.go line 258 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: ditto for removing redundant logging.

Done.

@knz knz force-pushed the 20230702-tenant-check-record branch from cfd1204 to 0039c66 Compare July 11, 2023 13:32
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steven, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 115 at r3 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Adding a bit more here. I meant something like:

dataState, err := mtinfopb.NewDataState(rawDataState)
if err != nil {
    log.Warning...
}

Where NewDataState is where the bounds check lives.

But, I don't feel strongly about that.

Thanks, I added a commit to do that.

@knz knz force-pushed the 20230702-tenant-check-record branch 2 times, most recently from ffed569 to d9e57d7 Compare July 11, 2023 17:39
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 33 of 33 files at r10, 13 of 13 files at r11, 9 of 9 files at r12, 15 of 15 files at r13, 7 of 7 files at r14, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nkodali, @renatolabs, @srosenberg, and @stevendanna)


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 67 at r14 (raw file):

	// The remaining columns are stored in the value; we're just interested in the
	// info column.

nit: "just info column" seems out of date now.

knz added 5 commits July 23, 2023 16:15
This patch extends the infrastructure originally designed to track
only the capabilities field of tenant records, into a general-purpose
cache for more fields in each tenant record.

In this patch, it is extended to also track changes to the tenant
name, service mode and data state.

Because of this change, all the APIs and package names are becoming
incorrect for their purpose (the focus is not on just capabilities any
more). The work to rename them will happen in a later commit.

Release note: None
Context: This change is part of a larger epic to make SQL servers
aware of more of their tenant metadata (incl service mode, data state,
capabilities and tenant name).

There are two moving pieces in this commit:

- an extension to the `TenantSettings` streaming RPC endpoint.

  The vision here is to piggy-back on the existing `TenantSettings`,
  which so far was only used to communicate tenant setting overrides,
  to also communicate other tenant metadata. We are careful to ensure
  the new data appears as a no-op from the perspective of
  previous-version clients of the RPC.

- a channel-based synchronization protocol to ensure that the
  streaming endpoint `TenantSettings` gets activated precisely
  when there are row updates in the tenants table.

  Here we mirror one-to-one the synchronization logic already present
  in the "tenant settings
  watcher" (`pkg/server/tenantsettingswatcher/watcher.go`). The reviewer
  can satisfy themselves the logic is identical by putting the source
  code for both watchers side-by-side.

As in the previous commit, the extension in purpose means that
the various things involved here are not named properly any more;

- the "capabilities watcher" is hardly focused on capabilities any
  more.
- the "TenantSettings" endpoint (and the request/response types)
  are hardly focused on settings any more.

Later work will take care of renaming these various components to
properly reflect their generalized purpose.

Release note: None
Prior to this patch, the API on the tenant info watcher would return
an "undefined" metadata payload if called prior to the tenant record
being created.

This was most visible in the following scenario:

1. new cluster starts. watcher+rangefeed start successfully (tenant
   table empty)
2. tenant client connects. At this time there is no metadata for
   its tenant ID, so the metadata payload is available but empty.
3. CREATE TENANT is executed for the new tenant.
4. only then (later) the rangefeed introduces the metadata
   into the cache.

This is insufficient for use by the KV tenant client RPCs: there we
only want to accept incoming requests from tenant clients after we
actually have seen metadata for them.

This patch improves the situation by checking whether the tenant
record is present before returning bogus data to the SQL tenant
client.

Simultaneously, we add error handling logic in the SQL tenant client
to handle this case gracefully.

In a mixed-version deployment (with and without this patch applied),
the following behaviors will occur if one starts the SQL tenant server
without a tenant record defined:

- Unpatched server: Late startup error in client with "database 1 does not exist".
- Patched server: Client loops forever, waiting for tenant record.

Behavior when the tenant record is created shortly *after* the SQL
tenant server starts up:

- Unpatched server: Inconsistent / non-deterministic behavior.
- Patched server: Clean startup of client after tenant record appears.

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nkodali, @renatolabs, @srosenberg, @stevendanna, and @yuzefovich)


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/decoder.go line 67 at r14 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "just info column" seems out of date now.

Fixed.

@knz knz force-pushed the 20230702-tenant-check-record branch from d9e57d7 to 35ea6d5 Compare July 23, 2023 14:21
@knz
Copy link
Contributor Author

knz commented Jul 23, 2023

TFYR!

bors r=yuzefovich,stevendanna

@craig
Copy link
Contributor

craig bot commented Jul 23, 2023

Build succeeded:

@craig craig bot merged commit 164c8cb into cockroachdb:master Jul 23, 2023
@knz knz deleted the 20230702-tenant-check-record branch July 23, 2023 15:33
@healthy-pod
Copy link
Contributor

healthy-pod commented Jul 27, 2023

After this PR was merged, the multitenant-upgrade started hanging due to:

net/http/server.go:3228 ⋮ [-] 528  ‹http: TLS handshake error from redacted_ip:redacted_port: remote error: tls: bad certificate›

@knz
Copy link
Contributor Author

knz commented Jul 28, 2023

followup: #107824

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants