Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig/store: support applying a batch of updates #73150

Merged
merged 1 commit into from
Nov 30, 2021

Conversation

irfansharif
Copy link
Contributor

We first introduced spanconfig.StoreWriter in #70287. Here we extend the
interface to accept a batch of updates instead of just one.

type StoreWriter interface {
    Apply(ctx context.Context, dryrun bool, updates ...Update) (
        deleted []roachpb.Span, added []roachpb.SpanConfigEntry,
    )
}

The implementation is subtle -- we're not processing one update at a
time. The semantics we're after is applying a batch of updates
atomically on a consistent snapshot of the underlying store. This comes
up in the upcoming spanconfig.Reconciler (#71994) -- there, following a
zone config/descriptor change, we want to update KV state in a single
request/RTT instead of an RTT per descendent table. The intended
usage is better captured in the aforementioned PR; let's now talk
about what this entails for the datastructure.

To apply a single update, we want to find all overlapping spans and
clear out just the intersections. If the update is adding a new span
config, we'll also want to add the corresponding store entry after. We
do this by deleting all overlapping spans in their entirety and
re-adding the non-overlapping segments. Pseudo-code:

for entry in store.overlapping(update.span):
  union, intersection = union(update.span, entry), intersection(update.span, entry)
  pre  = span{union.start_key, intersection.start_key}
  post = span{intersection.end_key, union.end_key}

  delete {span=entry.span, conf=entry.conf}
  if entry.contains(update.span.start_key):
    # First entry overlapping with update.
    add {span=pre, conf=entry.conf} if non-empty
  if entry.contains(update.span.end_key):
    # Last entry overlapping with update.
    add {span=post, conf=entry.conf} if non-empty

if adding:
  add {span=update.span, conf=update.conf} # add ourselves

When extending to a set of updates, things are more involved. Let's
assume that the updates are non-overlapping and sorted by start key. As
before, we want to delete overlapping entries in their entirety and
re-add the non-overlapping segments. With multiple updates, it's
possible that a segment being re-added will overlap another update. If
processing one update at a time in sorted order, we want to only re-add
the gap between the consecutive updates.

keyspace         a  b  c  d  e  f  g  h  i  j
existing state      [--------X--------)
updates          [--A--)           [--B--)

When processing [a,c):A, after deleting [b,h):X, it would be incorrect
to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of
re-adding the trailing segment right away, we carry it forward and
process it when iterating over the second, possibly overlapping update.
In our example, when iterating over [g,i):B we can subtract the overlap
from [c,h):X and only re-add [c,g):X.

It's also possible for the segment to extend past the second update. In
the example below, when processing [d,f):B and having [b,h):X carried
over, we want to re-add [c,d):X and carry forward [f,h):X to the update
after (i.e. [g,i):C)).

keyspace         a  b  c  d  e  f  g  h  i  j
existing state      [--------X--------)
updates          [--A--)  [--B--)  [--C--)

One final note: we're iterating through the updates without actually
applying any mutations. Going back to our first example, when processing
[g,i):B, retrieving the set of overlapping spans would (again) retrieve
[b,h):X -- an entry we've already encountered when processing [a,c):A.
Re-adding non-overlapping segments naively would re-add [b,g):X -- an
entry that overlaps with our last update [a,c):A. When retrieving
overlapping entries, we need to exclude any that overlap with the
segment that was carried over. Pseudo-code:

carry-over = <empty>
for update in updates:
  carried-over, carry-over = carry-over, <empty>
  if update.overlap(carried-over):
      # Fill in the gap between consecutive updates.
      add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf}
      # Consider the trailing span after update; carry it forward if non-empty.
      carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf}
  else:
      add {span=carried-over.span, conf=carried-over.conf} if non-empty

  for entry in store.overlapping(update.span):
     if entry.overlap(processed):
          continue # already processed

      union, intersection = union(update.span, entry), intersection(update.span, entry)
      pre  = span{union.start_key, intersection.start_key}
      post = span{intersection.end_key, union.end_key}

      delete {span=entry.span, conf=entry.conf}
      if entry.contains(update.span.start_key):
          # First entry overlapping with update.
          add {span=pre, conf=entry.conf} if non-empty
      if entry.contains(update.span.end_key):
          # Last entry overlapping with update.
          carry-over = {span=post, conf=entry.conf}

   if adding:
      add {span=update.span, conf=update.conf} # add ourselves

add {span=carry-over.span, conf=carry-over.conf} if non-empty

We've extended the randomized testing suite to generate batches of
updates at a time. We've also added a few illustrated datadriven tests.

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner November 25, 2021 01:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool change! Let's extend the testing a bit more to include deletes in addition to sets in batches, like I commented in-line, but other than that this feels close. 💯

Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 195 at r1 (raw file):

// StoreWriter is the write-only portion of the Store interface.
type StoreWriter interface {
	// Apply applies the given update[1]. It also returns the existing spans that

Let's update this text to talk about multiple updates. It's also worth calling out on the interface that the set of updates must be non-overlapping.


pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go, line 338 at r1 (raw file):

			s.mu.Lock()
			for _, ev := range events {
				s.mu.internal.Apply(ctx, false /* dryrun */, ev.(*bufferEvent).Update)

Add a TODO here to push all updated down into the Store instead of applying them one by one so that we can get rid of the locking here (from the conversation we had a couple of weeks ago).


pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):

	deleted, added, err := s.applyInternal(dryrun, updates...)
	if err != nil {
		log.Fatalf(ctx, "%v", err)

Any reason to fatal here instead of bubbling up the error to the caller?


pkg/spanconfig/spanconfigstore/store_test.go, line 60 at r1 (raw file):

// 		apply
// 		set [c,e):C
// 		delete [c,e)

This shouldn't be valid because the spans are overlapping, right? Let's update this comment with "real" inputs and outputs please.


pkg/spanconfig/spanconfigstore/store_test.go, line 216 at r1 (raw file):

				updates[i] = getRandomUpdate()
			}
			sort.Slice(updates, func(i, j int) bool {

Instead of sorting here and then doing a linear scan to figure out overlaps, you can compare each span to figure out overlaps so you don't pass sorted input to Apply.


pkg/spanconfig/spanconfigstore/testdata/batched/fractured, line 1 at r1 (raw file):

# Test semantics of batched updates that partially overlap with what's already

Should this be in the testdata/single folder instead? You're only using batched updates for the test setup.


pkg/spanconfig/spanconfigstore/testdata/batched/overlapping, line 1 at r1 (raw file):

# Test semantics of batched updates that partially overlap with what's already

This is a pretty cool test! We should extend it to include some deletes in the batched updates as well (maybe in a separate file, for readability).


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):

# Test semantics of batched updates that partially overlap with what's already
# present.

How're you viewing "fracturd" vs. "straddle"? Looks like they're setting up things differently, but I'm confused by the terminology.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 32 at r1 (raw file):

# keys    a  b  c  d  e  f  g  h  i  j
# state      [--A--)  [--B--)
# set           [---X-------)

Let's add another test where we set [c,h]: X to make this span extend beyond on the right.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 45 at r1 (raw file):

# keys    a  b  c  d  e  f  g  h  i  j
# state      [--A--)  [--B--)
# set        [---X-------)

Similar to above, let's add a test where we set [a,f]: X to make this span extend beyond on the left.

@irfansharif irfansharif force-pushed the 211123.batch-sc-store branch from 6bdecb2 to d3bd68b Compare November 30, 2021 12:43
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 195 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's update this text to talk about multiple updates. It's also worth calling out on the interface that the set of updates must be non-overlapping.

Done.


pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go, line 338 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Add a TODO here to push all updated down into the Store instead of applying them one by one so that we can get rid of the locking here (from the conversation we had a couple of weeks ago).

Done.


pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Any reason to fatal here instead of bubbling up the error to the caller?

applyInternal is the bubble-up-err version of this method that we're using in tests. Any errors here are data structure bugs, there's nothing the caller can do -- what's stored has either been corrupted or is being used incorrectly. Fatal-ing feels like the right thing to do.


pkg/spanconfig/spanconfigstore/store_test.go, line 60 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This shouldn't be valid because the spans are overlapping, right? Let's update this comment with "real" inputs and outputs please.

Done.


pkg/spanconfig/spanconfigstore/store_test.go, line 216 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Instead of sorting here and then doing a linear scan to figure out overlaps, you can compare each span to figure out overlaps so you don't pass sorted input to Apply.

Added a Shuffle instead.


pkg/spanconfig/spanconfigstore/testdata/batched/overlapping, line 1 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This is a pretty cool test! We should extend it to include some deletes in the batched updates as well (maybe in a separate file, for readability).

Done (added here for contrast with non-deletes).


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

How're you viewing "fracturd" vs. "straddle"? Looks like they're setting up things differently, but I'm confused by the terminology.

Renamed to "encompass". I was distinguishing between updates straddling two spans and updates that encompassed multiple spans wholly. It's a flimsy distinction, but presented as edge cases we needed to consider.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 32 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's add another test where we set [c,h]: X to make this span extend beyond on the right.

Done.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 45 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Similar to above, let's add a test where we set [a,f]: X to make this span extend beyond on the left.

Done.


pkg/spanconfig/spanconfigstore/testdata/batched/fractured, line 1 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should this be in the testdata/single folder instead? You're only using batched updates for the test setup.

Added some more batched ops instead.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 262 at r2 (raw file):

	// [1]: Unless dryrun is true. We'll still generate the same {deleted,added}
	//      lists.
	// [2]: We could alternatively expose a GetAllOverlapping() API to make

I was re-reading your comment about "provided updates aren't no-ops" again and that feels like an unfortunate implementation detail leaking to the interface. It also doesn't feel great to include the same span in both the added and deleted list either. This is just musings for the future, but maybe we should consider a GetAllOverlapping API at some point.


pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

applyInternal is the bubble-up-err version of this method that we're using in tests. Any errors here are data structure bugs, there's nothing the caller can do -- what's stored has either been corrupted or is being used incorrectly. Fatal-ing feels like the right thing to do.

Yeah, but you would also fatal if the caller tries to Apply overlapping spans. No strong feelings though.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Renamed to "encompass". I was distinguishing between updates straddling two spans and updates that encompassed multiple spans wholly. It's a flimsy distinction, but presented as edge cases we needed to consider.

The distinction makes sense. Let's update the comment at the top of the file to be a bit more specific about this?

We first introduced spanconfig.StoreWriter in cockroachdb#70287. Here we extend the
interface to accept a batch of updates instead of just one.

    type StoreWriter interface {
        Apply(ctx context.Context, dryrun bool, updates ...Update) (
	    deleted []roachpb.Span, added []roachpb.SpanConfigEntry,
        )
    }

The implementation is subtle -- we're not processing one update at a
time. The semantics we're after is applying a batch of updates
atomically on a consistent snapshot of the underlying store. This comes
up in the upcoming spanconfig.Reconciler (cockroachdb#71994) -- there, following a
zone config/descriptor change, we want to update KV state in a single
request/RTT instead of an RTT per descendent table. The intended
usage is better captured in the aforementioned PR; let's now talk
about what this entails for the datastructure.

To apply a single update, we want to find all overlapping spans and
clear out just the intersections. If the update is adding a new span
config, we'll also want to add the corresponding store entry after. We
do this by deleting all overlapping spans in their entirety and
re-adding the non-overlapping segments. Pseudo-code:

  for entry in store.overlapping(update.span):
      union, intersection = union(update.span, entry), intersection(update.span, entry)
      pre  = span{union.start_key, intersection.start_key}
      post = span{intersection.end_key, union.end_key}

      delete {span=entry.span, conf=entry.conf}
      if entry.contains(update.span.start_key):
          # First entry overlapping with update.
          add {span=pre, conf=entry.conf} if non-empty
      if entry.contains(update.span.end_key):
          # Last entry overlapping with update.
          add {span=post, conf=entry.conf} if non-empty

  if adding:
      add {span=update.span, conf=update.conf} # add ourselves

When extending to a set of updates, things are more involved. Let's
assume that the updates are non-overlapping and sorted by start key. As
before, we want to delete overlapping entries in their entirety and
re-add the non-overlapping segments. With multiple updates, it's
possible that a segment being re-added will overlap another update.  If
processing one update at a time in sorted order, we want to only re-add
the gap between the consecutive updates.

    keyspace         a  b  c  d  e  f  g  h  i  j
    existing state      [--------X--------)
    updates          [--A--)           [--B--)

When processing [a,c):A, after deleting [b,h):X, it would be incorrect
to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of
re-adding the trailing segment right away, we carry it forward and
process it when iterating over the second, possibly overlapping update.
In our example, when iterating over [g,i):B we can subtract the overlap
from [c,h):X and only re-add [c,g):X.

It's also possible for the segment to extend past the second update. In
the example below, when processing [d,f):B and having [b,h):X carried
over, we want to re-add [c,d):X and carry forward [f,h):X to the update
after (i.e. [g,i):C)).

    keyspace         a  b  c  d  e  f  g  h  i  j
    existing state      [--------X--------)
    updates          [--A--)  [--B--)  [--C--)

One final note: we're iterating through the updates without actually
applying any mutations. Going back to our first example, when processing
[g,i):B, retrieving the set of overlapping spans would (again) retrieve
[b,h):X -- an entry we've already encountered when processing [a,c):A.
Re-adding non-overlapping segments naively would re-add [b,g):X -- an
entry that overlaps with our last update [a,c):A. When retrieving
overlapping entries, we need to exclude any that overlap with the
segment that was carried over. Pseudo-code:

  carry-over = <empty>
  for update in updates:
      carried-over, carry-over = carry-over, <empty>
      if update.overlap(carried-over):
          # Fill in the gap between consecutive updates.
          add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf}
          # Consider the trailing span after update; carry it forward if non-empty.
          carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf}
      else:
          add {span=carried-over.span, conf=carried-over.conf} if non-empty

      for entry in store.overlapping(update.span):
         if entry.overlap(processed):
              continue # already processed

          union, intersection = union(update.span, entry), intersection(update.span, entry)
          pre  = span{union.start_key, intersection.start_key}
          post = span{intersection.end_key, union.end_key}

          delete {span=entry.span, conf=entry.conf}
          if entry.contains(update.span.start_key):
              # First entry overlapping with update.
              add {span=pre, conf=entry.conf} if non-empty
          if entry.contains(update.span.end_key):
              # Last entry overlapping with update.
              carry-over = {span=post, conf=entry.conf}

       if adding:
          add {span=update.span, conf=update.conf} # add ourselves

  add {span=carry-over.span, conf=carry-over.conf} if non-empty

We've extended the randomized testing suite to generate batches of
updates at a time. We've also added a few illustrated datadriven tests.

Release note: None
@irfansharif irfansharif force-pushed the 211123.batch-sc-store branch from d3bd68b to ed14f74 Compare November 30, 2021 15:54
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 262 at r2 (raw file):
Removed the no-op clause.

maybe we should consider a GetAllOverlapping API

There's that exact suggestion as a TODO below. But I don't think it ends up looking that bad even without it (#71994); I've come to prefer it actually. One peculiarity is that this read-only interface would intuitively sit on the StoreReader portion of the interface, but with no parallels in the sys-cfg span implementation. Feels awkward to include it in the StoreWriter? I dunno, bikeshed territory.

It also doesn't feel great to include the same span in both the added and deleted list either.

I'm not sure what you mean. The "added" list also includes fragments that were "re-added", because the updates only partially overlapped with existing spans -- the remainder need to be re-added. We actually need this property to interface with the KVAccessor.


pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Yeah, but you would also fatal if the caller tries to Apply overlapping spans. No strong feelings though.

Yea, that would mean the caller's using it incorrectly, aka a bug. I'm not sure that error's "retryable" either, so I feel the fatal more appropriate.


pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

The distinction makes sense. Let's update the comment at the top of the file to be a bit more specific about this?

Done.

@craig
Copy link
Contributor

craig bot commented Nov 30, 2021

Build succeeded:

@craig craig bot merged commit 40e4538 into cockroachdb:master Nov 30, 2021
@irfansharif irfansharif deleted the 211123.batch-sc-store branch November 30, 2021 17:41
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 14, 2022
Fixes cockroachdb#72389.
Fixes cockroachdb#66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

    exec-sql tenant=11
    CREATE DATABASE db;
    CREATE TABLE db.t0();
    CREATE TABLE db.t1();
    CREATE TABLE db.t2();
    CREATE TABLE db.t3();
    CREATE TABLE db.t4();
    CREATE TABLE db.t5();
    CREATE TABLE db.t6();
    CREATE TABLE db.t7();
    CREATE TABLE db.t8();
    CREATE TABLE db.t9();
    ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
    ----

    # If installing a unique zone config for a table in the middle, we
    # should observe three splits (before, the table itself, and after).

    diff offset=48
    ----
    --- gossiped system config span (legacy)
    +++ span config infrastructure (current)
    ...
     /Tenant/10                                 database system (tenant)
     /Tenant/11                                 database system (tenant)
    +/Tenant/11/Table/106                       range default
    +/Tenant/11/Table/111                       num_replicas=42
    +/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
    state when updating entries. On every update, we'd check if the span
    we're writing to has the same config as the preceding and/or
    succeeding one, and if so, write out a larger span instead. We
    previously prototyped a form of this in cockroachdb#68491.

    Pros:
    - reduced memory footprint of spanconfig.Store
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - uses more persisted storage than necessary
    - difficult to disable the optimization dynamically (though still
      possible -- we'd effectively have to restart the KVSubscriber and
      populate in-memory span config state using a store that
      does/doesn't coalesce configs)
    - difficult to implement; our original prototype did not have cockroachdb#73150
      which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
    maintained in reconciler itself.

    Pros:
    - reduced storage use (both persisted and in-memory)
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - very difficult to disable the optimization dynamically (each
      tenant process would need to be involved)
    - most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
    interface that accepts only a single update at a time and does not
    generate deltas (not something we need down in KV). Removes the
    implementation complexity.

(d) Keep the contents of `system.span_configurations` and the in-memory
    state of spanconfig.Stores as it is today, uncoalesced. When
    determining split points, iterate through adjacent configs within
    the provided key bounds and see whether we could ignore certain
    split keys.

    Pros:
    - easiest to implement
    - easy to disable the optimization dynamically, for ex. through a
      cluster setting
    Cons:
    - uses more storage (persisted and in-memory) than necessary
    - read path (computing splits) is more expensive if iterating
      through adjacent configs

----

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c)
possibly.

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 27, 2022
Fixes cockroachdb#72389.
Fixes cockroachdb#66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

    exec-sql tenant=11
    CREATE DATABASE db;
    CREATE TABLE db.t0();
    CREATE TABLE db.t1();
    CREATE TABLE db.t2();
    CREATE TABLE db.t3();
    CREATE TABLE db.t4();
    CREATE TABLE db.t5();
    CREATE TABLE db.t6();
    CREATE TABLE db.t7();
    CREATE TABLE db.t8();
    CREATE TABLE db.t9();
    ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
    ----

    # If installing a unique zone config for a table in the middle, we
    # should observe three splits (before, the table itself, and after).

    diff offset=48
    ----
    --- gossiped system config span (legacy)
    +++ span config infrastructure (current)
    ...
     /Tenant/10                                 database system (tenant)
     /Tenant/11                                 database system (tenant)
    +/Tenant/11/Table/106                       range default
    +/Tenant/11/Table/111                       num_replicas=42
    +/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
    state when updating entries. On every update, we'd check if the span
    we're writing to has the same config as the preceding and/or
    succeeding one, and if so, write out a larger span instead. We
    previously prototyped a form of this in cockroachdb#68491.

    Pros:
    - reduced memory footprint of spanconfig.Store
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - uses more persisted storage than necessary
    - difficult to disable the optimization dynamically (though still
      possible -- we'd effectively have to restart the KVSubscriber and
      populate in-memory span config state using a store that
      does/doesn't coalesce configs)
    - difficult to implement; our original prototype did not have cockroachdb#73150
      which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
    maintained in reconciler itself.

    Pros:
    - reduced storage use (both persisted and in-memory)
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - very difficult to disable the optimization dynamically (each
      tenant process would need to be involved)
    - most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
    interface that accepts only a single update at a time and does not
    generate deltas (not something we need down in KV). Removes the
    implementation complexity.

(d) Keep the contents of `system.span_configurations` and the in-memory
    state of spanconfig.Stores as it is today, uncoalesced. When
    determining split points, iterate through adjacent configs within
    the provided key bounds and see whether we could ignore certain
    split keys.

    Pros:
    - easiest to implement
    - easy to disable the optimization dynamically, for ex. through a
      cluster setting
    Cons:
    - uses more storage (persisted and in-memory) than necessary
    - read path (computing splits) is more expensive if iterating
      through adjacent configs

----

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c)
possibly.

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).
craig bot pushed a commit that referenced this pull request Apr 29, 2022
77975: sql,cli: implement `\password` cli command r=ZhouXing19 a=ZhouXing19

This commit is to add support for the `\password` that securely changes the
password for a user. 

fixes #48543

Release note (cli change): add support for `\password` cli command that enables
secure alteration of the password for a user. The given password will always be pre-hashed 
with the password hash method obtained via the session variable `password-encryption`,
i.e. `scram-sha-256` as the default hashing algorithm.

79700: spanconfig,kv: merge adjacent ranges with identical configs  r=irfansharif a=irfansharif

**spanconfig,kv: merge adjacent ranges with identical configs**

Fixes #72389.
Fixes #66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

    exec-sql tenant=11
    CREATE DATABASE db;
    CREATE TABLE db.t0();
    CREATE TABLE db.t1();
    CREATE TABLE db.t2();
    CREATE TABLE db.t3();
    CREATE TABLE db.t4();
    CREATE TABLE db.t5();
    CREATE TABLE db.t6();
    CREATE TABLE db.t7();
    CREATE TABLE db.t8();
    CREATE TABLE db.t9();
    ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
    ----

    # If installing a unique zone config for a table in the middle, we
    # should observe three splits (before, the table itself, and after).

    diff offset=48
    ----
    --- gossiped system config span (legacy)
    +++ span config infrastructure (current)
    ...
     /Tenant/10                                 database system (tenant)
     /Tenant/11                                 database system (tenant)
    +/Tenant/11/Table/106                       range default
    +/Tenant/11/Table/111                       num_replicas=42
    +/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
    state when updating entries. On every update, we'd check if the span
    we're writing to has the same config as the preceding and/or
    succeeding one, and if so, write out a larger span instead. We
    previously prototyped a form of this in #68491.

    Pros:
    - reduced memory footprint of spanconfig.Store
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - uses more persisted storage than necessary
    - difficult to disable the optimization dynamically (though still
      possible -- we'd effectively have to restart the KVSubscriber and
      populate in-memory span config state using a store that
      does/doesn't coalesce configs)
    - difficult to implement; our original prototype did not have #73150
      which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
    maintained in reconciler itself.

    Pros:
    - reduced storage use (both persisted and in-memory)
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - very difficult to disable the optimization dynamically (each
      tenant process would need to be involved)
    - most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
    interface that accepts only a single update at a time and does not
    generate deltas (not something we need down in KV). Removes the
    implementation complexity.

(d) Keep the contents of `system.span_configurations` and the in-memory
    state of spanconfig.Stores as it is today, uncoalesced. When
    determining split points, iterate through adjacent configs within
    the provided key bounds and see whether we could ignore certain
    split keys.

    Pros:
    - easiest to implement
    - easy to disable the optimization dynamically, for ex. through a
      cluster setting
    Cons:
    - uses more storage (persisted and in-memory) than necessary
    - read path (computing splits) is more expensive if iterating
      through adjacent configs

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):
```
    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS
```
It's feasible that in the future we re-work this in favor of (c).

---

**spanconfig/store: use templatized btree impl instead**

```
   $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                431093 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10              4308200 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10            43827373 ns/op
    PASS

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10    1.17ms ± 0%  0.43ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   12.4ms ± 0%   4.3ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10   136ms ± 0%    44ms ± 0%   ~     (p=1.000 n=1+1)
```

---

**spanconfig/store: intern span configs**

- Avoid the expensive proto.Equal() when computing split keys;
- Reduce memory overhead of the data structure
```
    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                 90323 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10               915936 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10             9575781 ns/op

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10     431µs ± 0%    90µs ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   4.31ms ± 0%  0.92ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10  43.8ms ± 0%   9.6ms ± 0%   ~     (p=1.000 n=1+1)
```

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).

Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Apr 29, 2022
Fixes #72389.
Fixes #66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

    exec-sql tenant=11
    CREATE DATABASE db;
    CREATE TABLE db.t0();
    CREATE TABLE db.t1();
    CREATE TABLE db.t2();
    CREATE TABLE db.t3();
    CREATE TABLE db.t4();
    CREATE TABLE db.t5();
    CREATE TABLE db.t6();
    CREATE TABLE db.t7();
    CREATE TABLE db.t8();
    CREATE TABLE db.t9();
    ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
    ----

    # If installing a unique zone config for a table in the middle, we
    # should observe three splits (before, the table itself, and after).

    diff offset=48
    ----
    --- gossiped system config span (legacy)
    +++ span config infrastructure (current)
    ...
     /Tenant/10                                 database system (tenant)
     /Tenant/11                                 database system (tenant)
    +/Tenant/11/Table/106                       range default
    +/Tenant/11/Table/111                       num_replicas=42
    +/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
    state when updating entries. On every update, we'd check if the span
    we're writing to has the same config as the preceding and/or
    succeeding one, and if so, write out a larger span instead. We
    previously prototyped a form of this in #68491.

    Pros:
    - reduced memory footprint of spanconfig.Store
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - uses more persisted storage than necessary
    - difficult to disable the optimization dynamically (though still
      possible -- we'd effectively have to restart the KVSubscriber and
      populate in-memory span config state using a store that
      does/doesn't coalesce configs)
    - difficult to implement; our original prototype did not have #73150
      which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
    maintained in reconciler itself.

    Pros:
    - reduced storage use (both persisted and in-memory)
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - very difficult to disable the optimization dynamically (each
      tenant process would need to be involved)
    - most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
    interface that accepts only a single update at a time and does not
    generate deltas (not something we need down in KV). Removes the
    implementation complexity.

(d) Keep the contents of `system.span_configurations` and the in-memory
    state of spanconfig.Stores as it is today, uncoalesced. When
    determining split points, iterate through adjacent configs within
    the provided key bounds and see whether we could ignore certain
    split keys.

    Pros:
    - easiest to implement
    - easy to disable the optimization dynamically, for ex. through a
      cluster setting
    Cons:
    - uses more storage (persisted and in-memory) than necessary
    - read path (computing splits) is more expensive if iterating
      through adjacent configs

----

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c)
possibly.

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants