Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: storage: Support scattering replicas in addition to leases #26438

Closed
wants to merge 3 commits into from

Conversation

a-robinson
Copy link
Contributor

Fixes #23358

Release note: TBD

This currently isn't a very satisfying experience. It doesn't scatter things as widely as I'd like, with a lot of replicas for a table still winding up on the same handful of nodes that they were predominantly on to begin with. It helps, as documented on #26059, especially if you run it multiple times and on all of your tables rather than just one, but needs to be better. I think that putting everything through the allocator logic, while necessary for respecting zone constraints, is hurting the variety of stores on which things end up.

Sending out as a WIP because I'm going to focus on improving the actual rebalancing logic as the first-order goal, with this being a less important addition to follow up on later. It still needs an option added to the SQL grammar so that it isn't used every time SCATTER is run.

@a-robinson a-robinson requested review from a team June 5, 2018 19:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2018

Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, some commit checks failed.


pkg/sql/scatter.go, line 139 at r2 (raw file):

		RequestHeader:   roachpb.RequestHeader{Key: n.run.span.Key, EndKey: n.run.span.EndKey},
		RandomizeLeases: true,
		// TODO: Require an additional option for this.

Why? I'd think that normally when you issue a scatter command you'd want both.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why? I'd think that normally when you issue a scatter command you'd want both.

Primarily because moving so many replicas is a lot more resource-intensive than just moving leases, and if anyone has made SCATTER a part of their workflow they could be in for quite a surprise when they run it after upgrading to 2.1.

Perhaps that's overly protective, though. I'm certainly open to leaving it like this.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2018

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Primarily because moving so many replicas is a lot more resource-intensive than just moving leases, and if anyone has made SCATTER a part of their workflow they could be in for quite a surprise when they run it after upgrading to 2.1.

Perhaps that's overly protective, though. I'm certainly open to leaving it like this.

Yeah, it's more expensive, but I think that's what people generally want when they're calling this. We could have a separate lease-only option if there's concern about this, but I'm not sure if anyone would use it (check with Robert).


Comments from Reviewable

This avoids a flood of snapshots all at once and allows later
rebalancing decisions to be based on updated store info from the earlier
decisions, rather than all the decisions happening at about the same
time with the same information, which will make the same few nodes look
most desirable.

Release note: None
@petermattis
Copy link
Collaborator

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/sql/scatter.go, line 139 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, it's more expensive, but I think that's what people generally want when they're calling this. We could have a separate lease-only option if there's concern about this, but I'm not sure if anyone would use it (check with Robert).

I'm ok with not adding an option for this. Prior to this PR ALTER TABLE ... SCATTER would move replicas if there was an imbalance (it looped until the replicate queue decided there was nothing to do). Yes, forcing replica rebalancing is a change in behavior, but I don't think it is drastic. One thing I'm interested in is how this affects stuff like kv --splits=1000. Does the forced rebalancing significantly slow down that operation? That might be an argument for having an ALTER TABLE ... SCATTER LEASES.


pkg/storage/replica_command.go, line 2121 at r3 (raw file):

		}
		numReplicas := int(zone.NumReplicas)
		replicasToAdd := rand.Intn(liveStores) + 1 - numReplicas

Rather than liveStores, don't you want to know how many candidate stores there are given the zone config constraints?


pkg/storage/replica_command.go, line 2126 at r3 (raw file):

		} else if replicasToAdd > numReplicas {
			replicasToAdd = numReplicas
		}

So you're adding up to N more replicas and then letting replicate queue chose which ones to remove. What is preventing replicate queue from removing the newly added replicas? Does this approach work well in practice? I don't have a better suggestion. The code could you a comment describing what it is doing.


pkg/storage/replica_command.go, line 2130 at r3 (raw file):

		var replicasAdded int
		for re := retry.StartWithCtx(ctx, retryOpts); re.Next() && replicasAdded < replicasToAdd; {
			requeue, err := rq.processOneChange(ctx, r, sysCfg, func() bool { return false } /* canTransferLease */, false /* dryRun */, true /* disableStatsBasedRebalancing */, true /* forceAddReplica */)

That's a lot of boolean arguments. I think you should change processOneChange to take an options struct. I see you've already added replicateQueueOptions, but it appears to be unused.


Comments from Reviewable

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 11, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 11, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 14, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 14, 2018
With this update, TPC-C 10k on 30 went from overloaded to running at
peak efficiency over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison). This is without having to run the replica
scatter from cockroachdb#26438.

Doing a 5 minute run to get a result that doesn't include all the
rebalancing time shows:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  290.9s   124799.1  97.0%    548.6    486.5    872.4   1140.9   2281.7  10200.5

I think it may have a small bug in it still, since at one point early on
one of the replicas from the warehouse table on the node doing the
relocating thought that it had 16-17k QPS, which wasn't true by any
other metric in the system. Restarting the node fixed it though.
I'm not too concerned about the bug, since I assume I just made a code
mistake, not that anything about the approach fundamentally leads to a
random SQL table replica gets 10s of thousands of QPS.

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Release note: None
@a-robinson
Copy link
Contributor Author

I was using this while testing out load-based lease rebalancing and ran into a panic:

I180802 06:28:10.925661 3983220 storage/store.go:2551  [replicaGC,n1,s1,r7476/2:/Table/57/1/2{89/757…-90/115…}] removing replica
I180802 06:28:10.928514 3327671 util/mon/bytes_usage.go:592  [n1,client=10.142.0.36:46658,user=root] txn: bytes usage increases to 1.7 MiB (+1781760)
I180802 06:28:10.930257 3983220 storage/replica.go:842  [replicaGC,n1,s1,r7476/2:/Table/57/1/2{89/757…-90/115…}] removed 103972 (103965+7) keys in 4ms [clear=2ms commit=2ms]
E180802 06:28:10.931386 1768951 sql/conn_executor.go:657  [n1,client=127.0.0.1:48520,user=root] a SQL panic has occurred while executing "ALTER TABLE tpcc.public.stock SCATTER": context canceled
I180802 06:28:10.934749 3327671 util/mon/bytes_usage.go:592  [n1,client=10.142.0.36:46658,user=root] txn: bytes usage increases to 3.2 MiB (+3389440)
E180802 06:28:10.936822 1768951 util/log/crash_reporting.go:203  [n1,client=127.0.0.1:48520,user=root] a panic has occurred!
E180802 06:28:10.951230 1768951 util/log/crash_reporting.go:477  [n1,client=127.0.0.1:48520,user=root] Reported as error a1f4e80dd4e04ca699587d56f4eeb48d
panic: context canceled [recovered]
        panic: panic while executing 1 statements: ALTER TABLE _._._ SCATTER; caused by context canceled

goroutine 1768951 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc436288000, 0x2d41880, 0xc42ce7ce80, 0x278aa00, 0xc42a1002d0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:671 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc436288000, 0x2d41880, 0xc42ce7ce80)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:413 +0x61
panic(0x278aa00, 0xc42a1002d0)
        /usr/local/go/src/runtime/panic.go:502 +0x229
github.com/cockroachdb/cockroach/pkg/kv.(*RangeIterator).Desc(0xc431f08640, 0x2d41940)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_iter.go:70 +0x6a
github.com/cockroachdb/cockroach/pkg/sql.(*scatterNode).Next(0xc42c1891a0, 0x2d41940, 0xc42a1ab680, 0xc4362884c0, 0xc436288430, 0xc421d74c01, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scatter.go:164 +0xb4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).forEachRow(0xc436288000, 0x2d41940, 0xc42a1ab680, 0xc4362884c0, 0xc436288430, 0x2d335c0, 0xc42c1891a0, 0xc429492b28, 0x19c42c1, 0x2d41940)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0xb0
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithLocalEngine(0xc436288000, 0x2d41940, 0xc42a1ab680, 0xc436288430, 0x3, 0x7fd4bf799350, 0xc428e28b40, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:887 +0x286
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:802 +0xa8f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:401 +0xb1e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc436288000, 0x2d41940, 0xc42a1ab680, 0x2d441c0, 0xc42ce7cf40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:95 +0x358
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc436288000, 0x2d41880, 0xc42ce7ce80, 0xc4272c1220, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1078 +0x2055
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc42058cea0, 0x2d41880, 0xc42ce7ce80, 0x0, 0x0, 0xc462b11050, 0x4, 0xc462b1103d, 0xd, 0x2d22e40, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:415 +0x1bb
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func3(0xc42058cea0, 0x2d41880, 0xc42ce7ce80, 0xc450fa8a80, 0x5400, 0x15000, 0xc4208a4600, 0xc4272c1220, 0xc4272c11f0, 0xc43019a830, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:267 +0x122
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:266 +0xf27

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Aug 3, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None

[prototype] storage: Extend new allocator to also move range replicas

With this update, TPC-C 10k on 30 went from overloaded to running at
peak efficiency over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison). This is without having to run the replica
scatter from cockroachdb#26438.

Doing a 5 minute run to get a result that doesn't include all the
rebalancing time shows:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  290.9s   124799.1  97.0%    548.6    486.5    872.4   1140.9   2281.7  10200.5

I think it may have a small bug in it still, since at one point early on
one of the replicas from the warehouse table on the node doing the
relocating thought that it had 16-17k QPS, which wasn't true by any
other metric in the system. Restarting the node fixed it though.
I'm not too concerned about the bug, since I assume I just made a code
mistake, not that anything about the approach fundamentally leads to a
random SQL table replica gets 10s of thousands of QPS.

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Release note: None

Release note: None
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@bobvawter
Copy link
Contributor

bobvawter commented Mar 10, 2020

CLA assistant check
All committers have signed the CLA.

@a-robinson a-robinson closed this Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: SCATTER should have an option to randomize replicas
6 participants