Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: diagnose performance issues with tpc-c when not using partitioning #26059

Closed
a-robinson opened this issue May 24, 2018 · 19 comments
Closed
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@a-robinson
Copy link
Contributor

No description provided.

@a-robinson a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels May 24, 2018
@a-robinson a-robinson added this to the 2.1 milestone May 24, 2018
@a-robinson a-robinson self-assigned this May 24, 2018
@a-robinson
Copy link
Contributor Author

a-robinson commented May 30, 2018

After some initial investigation today, there's one problem so big that it pretty much obscures whatever problems may exist beneath it -- bad replica balance. For the most part, the replicas for any given table mostly fall on the same 3 nodes. The worst offender here is the tpcc.warehouse table, which gets the most load of any table as far as I'm aware (please correct that if I'm wrong). Almost all of the replicas for its 1000 ranges ended up on the same 3 nodes, meaning that the best leaseholder rebalancing in the world couldn't make for good performance. Those 3 nodes are indeed the 3 most loaded along dimensions such as number of BatchRequests processed, CPU usage, and keys written per second, with 1 of them in particular really standing out in goroutines (20k vs 2k for the next highest).

This could be fixed in a number of ways, such as:

  • Load-based rebalancing
  • Trying to diversify the nodes a table's ranges are on when rebalancing
  • Implementing a version of ALTER TABLE ... SCATTER that scatters replicas in addition to leases

The last one should get the quickest results and we've talked about wanting it anyway, so I'll work on that tomorrow and see where it leaves things.


I do have a couple questions about oddities I've noticed, though:

  1. r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that its keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?
  2. The tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?

Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:

  • Breakdown of which ranges are on each node, i.e. a breakdown of how many are from each table
  • Breakdown of where the replicas for each table are located, i.e. their distribution across the nodes
  • Breakdown of where the leaseholders for each table are located, i.e. their distribution across the nodes
  • List of the ranges receiving the most traffic (QPS and/or keys written per second), where their leaseholders are, where their replicas are, which tables they're from

I ended up manually constructing most of this data with the raft debug endpoint + a few python scripts as well as SHOW TESTING_RANGES + some bash.

@a-robinson
Copy link
Contributor Author

And to be clear, the point of the first paragraph above is that leaseholder rebalancing can't fix the problem by itself if all the warehouse range replicas are on the same 3 nodes.

@petermattis
Copy link
Collaborator

r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?

i'd expect r1 to be mostly idle in the steady state. @nvanbenschoten, @tschottdorf any theories?

The tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?

This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.

Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:

Heh, I came up with a similar list when debugging tpcc performance issues. I believe there are issues filed for many of these. @piyush-singh please double check.

@a-robinson
Copy link
Contributor Author

This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.

I can't confirm it right now since I tore the cluster down when I was done last night, but I think it could conceivably be due to errors from that node. I recall there being a fairly large (albeit spiky) amount of "Replica Errors" in the UI's graph of RPC errors. It's possible that those were mostly coming from n11's replicas, and that retries were then bumping up the count of batches sent to n11's replicas.

@bdarnell
Copy link
Contributor

Implementing a version of ALTER TABLE ... SCATTER that scatters replicas in addition to ranges

What does "scatters replicas in addition to ranges" mean?

r1 has a lot of qps.

That's surprising. This is probably due to overly-broad cache invalidation when the DistSender gets a NotLeaseHolder error (or a few other things). It sounds like we might be invalidating both meta1 and meta2 when we should only be invalidating meta2.

@piyush-singh
Copy link

We will be showing location of replicas in a debug page shortly (#24855), and @vilterp and I have discussed using this layout to similarly show leaseholder counts.

@petermattis mentioned similar requests for information about ranges in #23379 (nodes to which they were replicated, size, load). I'll make a note to think through how best to show ranges in the UI for our last milestone since we'll be making some layout changes anyways.

Thanks for the write up and detailed requests @a-robinson

@nvanbenschoten
Copy link
Member

r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?

The trace is pretty suspicious for this issue. As you mentioned, we see a large number of QueryTxn requests in the trace. A lot of these something like:

event:[n11,s11,r55510/2:/Table/55/1/9{00-10}] no pusher found; backing off
 30575.736ms    390.389ms                                operation:dist sender
 30575.767ms      0.030ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] querying next range at /Min
 30575.806ms      0.040ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] r1: sending batch 1 QueryTxn to (n20,s19):8
 30575.827ms      0.021ms                                event:[s11,r55510/2:/Table/55/1/9{00-10},n11] sending request to alex-tpcc-0002:26257
 30575.848ms      0.021ms                                    operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:client txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
 30576.756ms      0.908ms                                        operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:server txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
 30576.776ms      0.019ms                                        event:[n20] 1 QueryTxn
 30576.781ms      0.005ms                                        event:[n20,s19] executing 1 requests
 30626.903ms     50.122ms                                        event:[n20,s19,r1/8:/{Min-System/}] read-only path
 30626.914ms      0.011ms                                        event:[n20,s19,r1/8:/{Min-System/}] command queue
 30626.918ms      0.004ms                                        event:[n20,s19,r1/8:/{Min-System/}] waiting for read lock
 30626.970ms      0.052ms                                        event:[n20,s19,r1/8:/{Min-System/}] read completed

Note that they're addressed to /Min. This trace is coming from

updatedPusher, waitingTxns, pErr = q.queryTxnStatus(
ctx, pusher.TxnMeta, true, waitingTxns, q.store.Clock().Now(),
)
and I suspect that we're querying a transaction that does not have a record yet (i.e. pusher.TxnMeta.Key == nil). For what it's worth, @spencerkimball has been dealing with issues like this lately with some of his contentionQueue changes.

things that would have been nice to have for the sake of debugging the performance issue:

@vilterp's replica matrix (#24855) would address at least the first two of these needs. Might be worth trying it out here.

@a-robinson
Copy link
Contributor Author

What does "scatters replicas in addition to ranges" mean?

Sorry, I guess I can't type late at night. I meant "scatters replicas in addition to leases".

@spencerkimball
Copy link
Member

Yes, we would be seeing spurious QueryTxn requests to r1 in the current code because the txn wait queue isn't checking whether there's a key available for the pusher before querying it for dependencies. The lack of a key addresses the QueryTxn request to r1. This is a one line fix as part of my current PR, but I'll break it out and send it for review.

craig bot pushed a commit that referenced this issue May 30, 2018
26238: storage: avoid querying pusher where transaction record can't exist r=spencerkimball a=spencerkimball

This change checks whether a txn has a non-nil key before querying it
if it's waiting on an extant transaction which owns a conflicting intent.
Previously, the code would query the pusher's txn, even if the key was
nil, which would just send a spurious `QueryTxn` request to the first
range in the keyspace.

See #26059

Release note: None

Co-authored-by: Spencer Kimball <[email protected]>
craig bot pushed a commit that referenced this issue May 31, 2018
26245: backport-2.0: storage: avoid querying pusher where transaction record can't exist r=tschottdorf a=tschottdorf

Backport 1/1 commits from #26238.

/cc @cockroachdb/release

---

This change checks whether a txn has a non-nil key before querying it
if it's waiting on an extant transaction which owns a conflicting intent.
Previously, the code would query the pusher's txn, even if the key was
nil, which would just send a spurious `QueryTxn` request to the first
range in the keyspace.

See #26059

Release note: None


Co-authored-by: Spencer Kimball <[email protected]>
@a-robinson
Copy link
Contributor Author

a-robinson commented May 31, 2018

Before scattering replicas, running tpc-c 5k on 15 nodes:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.3s    29544.7  45.9%  26271.4  12348.0  85899.3 103079.2 103079.2 103079.2

After scattering replicas (and leases), running tpc-c 5k on 15 nodes:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    61047.7  94.9%   1106.1    117.4    369.1   9663.7  19327.4  28991.0

So if our goal is to have a usable workaround, scattering replicas basically gets us there. And if we want to remove the manual aspect of it, it looks like it'll be more impactful to work on stats-based replica rebalancing than lease rebalancing.

@petermattis
Copy link
Collaborator

@a-robinson Nice improvement. The p99 and pMax latency looks higher than what I remember for tpcc-5k. Would be interesting to see how this compares to tpcc-5k with partitioning.

@a-robinson
Copy link
Contributor Author

Running with --ramp=30s to ramp up the traffic more smoothly seems to help a lot with the tail latencies:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60544.4  94.2%    143.7    134.2    218.1    260.0    453.0   4831.8

Of course, the sample sizes here aren't exactly large, so maybe the difference is just noise.

And to follow up on my questions from the other day:

  1. r1 isn't by far the hottest range anymore. Its qps hasn't dropped to 0, either, though. The slow query log hasn't been nearly as illuminating this time, just showing a handful of normal range-lookup events. It's weird it hasn't dropped closer to 0, but it's not causing any problems anymore.
  2. All the warehouse ranges are now down around 2 qps, with no outliers. So the outliers the other day probably were caused by error retries.

The hottest ranges in the cluster at this point are all from the tpcc.new_order table, followed by the tpcc.item table. Top 50 ranges in case you're curious, with the rangeIDs in the mid-hundreds being from new_order and the 30-thousands from item:

idx rangeID    QPS                nodeIDs     leaseholder node
  1:   496      217.292003       [5, 6, 4]      6
  2:   793      216.760419      [5, 10, 6]      5
  3:   653      216.523500      [5, 10, 6]      10
  4:   448      216.106647      [5, 10, 6]      10
  5:   557      215.943809      [15, 3, 1]      15
  6:   458      215.132572      [15, 12, 1]     15
  7:   533      193.794032      [5, 15, 4]      5
  8:   730      188.153795       [5, 6, 3]      6
  9:   293      159.987734      [10, 6, 9]      9
 10:   413      154.168710      [5, 15, 4]      5
 11:   613      152.213182      [15, 6, 1]      15
 12:   267      151.527005       [5, 6, 4]      6
 13:   797      151.262274       [5, 6, 3]      3
 14:   611      150.982270       [6, 4, 3]      6
 15:   457      150.963159       [5, 6, 3]      5
 16:   724      150.941294       [5, 3, 1]      3
 17:   615      150.589744      [5, 12, 4]      12
 18: 33644      123.715932       [8, 6, 7]      8
 19: 33197      112.783797       [8, 6, 7]      6
 20:   388      112.059562      [5, 10, 4]      4
 21:   540       97.112390       [5, 6, 4]      4
 22: 34061       94.627812       [8, 6, 7]      7
 23: 32384       92.206044       [8, 6, 7]      8
 24: 32110       86.193180       [8, 6, 7]      6
 25: 32675       82.701674       [8, 6, 7]      6
 26: 34390       73.451197       [8, 6, 7]      6
 27: 34878       70.419347       [8, 6, 7]      7
 28:   266       69.683754      [5, 15, 4]      15
 29: 31457       67.308302       [8, 6, 7]      6
 30: 33275       67.023898       [8, 6, 7]      7
 31:   976       65.729066      [5, 10, 6]      5
 32: 36672       63.460209      [5, 15, 4]      15
 33: 34217       60.512212       [8, 6, 7]      6
 34:   493       59.469083      [5, 6, 12]      6
 35: 34104       54.001622       [8, 6, 7]      7
 36: 32033       53.894249       [8, 6, 7]      6
 37: 31546       52.105440       [8, 6, 7]      6
 38: 32199       52.043978       [8, 6, 7]      7
 39: 34701       50.376507       [8, 6, 7]      8
 40: 32436       49.500585       [8, 6, 7]      8
 41: 32335       47.352396       [8, 6, 7]      6
 42: 32210       46.446430       [8, 6, 7]      8
 43: 34580       45.592824       [8, 6, 7]      8
 44: 32663       44.654265       [8, 6, 7]      8
 45: 33395       44.197098       [8, 6, 7]      8
 46: 33678       44.139657       [8, 6, 7]      7
 47: 31786       43.393267       [8, 6, 7]      8
 48:     1       42.440460      [2, 14, 12]     2
 49: 34153       41.976303       [8, 6, 7]      7
 50: 33196       40.899344       [8, 6, 7]      8

@vilterp
Copy link
Contributor

vilterp commented May 31, 2018

Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.

@petermattis
Copy link
Collaborator

The new_order table has warehouse ID as the first component of its primary key. I'm surprised to see such a delta in QPS across those new_order ranges. And I'm guessing this is only a fraction of the new_order ranges.

@a-robinson
Copy link
Contributor Author

Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.

hottest_ranges.py from https://gist.github.com/a-robinson/54fbaa6628ae9f1ad9c6185ecd28edb9 run on the output of the /_status/raft endpoint.

Disclaimer: the qps numbers are averages over the last 30 minutes, not instantaneous measurements. Also, that raft debug endpoint is very slow (>1 minute) on a large cluster.

And I'm guessing this is only a fraction of the new_order ranges.

No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of #24896):

root@localhost:26257/defaultdb> show testing_ranges from table tpcc.new_order;
+-----------------+-----------------+----------+-------------+--------------+
|    Start Key    |     End Key     | Range ID |  Replicas   | Lease Holder |
+-----------------+-----------------+----------+-------------+--------------+
| NULL            | /0              |      192 | {3,4,12}    |           12 |
| /0              | /0/1/2101/0     |    36721 | {3,4,12}    |           12 |
| /0/1/2101/0     | /138/9/2506     |      266 | {4,5,15}    |           15 |
| /138/9/2506     | /272/10/2564    |    36672 | {4,5,15}    |           15 |
| /272/10/2564    | /272/10/2564/0  |      337 | {4,5,15}    |           15 |
| /272/10/2564/0  | /521/6/2127     |      413 | {4,5,15}    |           15 |
| /521/6/2127     | /521/6/2127/0   |      460 | {5,10,15}   |           15 |
| /521/6/2127/0   | /770/1/2590     |      533 | {4,5,15}    |           15 |
| /770/1/2590     | /770/1/2590/0   |      198 | {3,6,15}    |           15 |
| /770/1/2590/0   | /1000           |      293 | {6,9,10}    |            6 |
| /1000           | /1018/7/2153    |    36711 | {6,9,10}    |            6 |
| /1018/7/2153    | /1018/7/2153/0  |      412 | {4,9,14}    |            9 |
| /1018/7/2153/0  | /1267/2/2616    |      458 | {1,12,15}   |            1 |
| /1267/2/2616    | /1267/2/2616/0  |      552 | {1,4,15}    |            1 |
| /1267/2/2616/0  | /1515/8/2179    |      557 | {1,3,15}    |            1 |
| /1515/8/2179    | /1515/8/2179/0  |      215 | {4,5,12}    |           12 |
| /1515/8/2179/0  | /1764/3/2642    |      267 | {2,4,6}     |            6 |
| /1764/3/2642    | /1764/3/2642/0  |      339 | {4,5,11}    |           11 |
| /1764/3/2642/0  | /1948/8/2638    |      388 | {4,5,10}    |            4 |
| /1948/8/2638    | /1948/8/2638/0  |      492 | {3,6,14}    |           14 |
| /1948/8/2638/0  | /2000           |      611 | {3,4,6}     |            3 |
| /2000           | /2197/4/2201    |    36697 | {3,4,6}     |            3 |
| /2197/4/2201    | /2197/4/2201/0  |      262 | {2,4,6,15}  |            2 |
| /2197/4/2201/0  | /2445/9/2664    |      457 | {5,6,13}    |            6 |
| /2445/9/2664    | /2445/9/2664/0  |      536 | {3,6,12}    |           12 |
| /2445/9/2664/0  | /2605/5/2267    |      540 | {2,6,8,15}  |            8 |
| /2605/5/2267    | /2605/5/2267/0  |      721 | {3,4,6}     |            3 |
| /2605/5/2267/0  | /2853/10/2730   |      730 | {3,5,6}     |            3 |
| /2853/10/2730   | /2853/10/2730/0 |      331 | {4,6,15}    |           15 |
| /2853/10/2730/0 | /2921/10/2590   |      493 | {5,6,12}    |           12 |
| /2921/10/2590   | /2921/10/2590/0 |      612 | {1,2,3}     |            1 |
| /2921/10/2590/0 | /3000           |      613 | {1,6,15}    |            1 |
| /3000           | /3170/6/2153    |    36685 | {1,6,15}    |            1 |
| /3170/6/2153    | /3170/6/2153/0  |      614 | {1,6,15}    |            1 |
| /3170/6/2153/0  | /3419/1/2616    |      615 | {4,5,12}    |           12 |
| /3419/1/2616    | /3419/1/2616/0  |      384 | {4,5,12}    |            4 |
| /3419/1/2616/0  | /3667/7/2179    |      496 | {4,5,6}     |            6 |
| /3667/7/2179    | /3667/7/2179/0  |      551 | {3,6,7}     |            3 |
| /3667/7/2179/0  | /3916/2/2642    |      724 | {1,3,5}     |            5 |
| /3916/2/2642    | /3916/2/2642/0  |      794 | {3,5,15}    |           15 |
| /3916/2/2642/0  | /4000           |      797 | {3,5,6}     |            3 |
| /4000           | /4164/8/2205    |    36692 | {3,5,6}     |            3 |
| /4164/8/2205    | /4164/8/2205/0  |      417 | {4,6,11,13} |            6 |
| /4164/8/2205/0  | /4413/3/2668    |      448 | {5,6,10}    |           10 |
| /4413/3/2668    | /4413/3/2668/0  |      535 | {3,5,6}     |            3 |
| /4413/3/2668/0  | /4661/9/2231    |      653 | {5,6,10}    |           10 |
| /4661/9/2231    | /4661/9/2231/0  |      726 | {4,5,13}    |           13 |
| /4661/9/2231/0  | /4910/4/2694    |      793 | {5,9,10}    |            5 |
| /4910/4/2694    | /4910/4/2694/0  |      845 | {5,10,13}   |           10 |
| /4910/4/2694/0  | /4999/10/3001   |      976 | {4,5,10}    |           10 |
| /4999/10/3001   | /5000           |     1091 | {4,6,11,13} |            6 |
| /5000           | NULL            |    36731 | {4,6,13}    |            4 |
+-----------------+-----------------+----------+-------------+--------------+
(52 rows)

@petermattis
Copy link
Collaborator

No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of #24896):

We don't manually split the new_order table. I'm not sure why not. Perhaps there are enough ranges already to not require splitting. The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.

@a-robinson
Copy link
Contributor Author

Using partitioning gets basically the same results as with replica scattering:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60837.6  94.6%     84.7     71.3    130.0    201.3    352.3   1006.6

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.4s    60587.1  94.2%     88.9     79.7    130.0    184.5    369.1    906.0

The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.

It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before #24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.

@petermattis
Copy link
Collaborator

It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before #24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.

If this is reproducible, can you file a separate issue about this? Seems like something for the bulkio folks to investigate.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Jun 11, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Jun 11, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Jun 14, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 3, 2018
As outlined in recent comments on cockroachdb#26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

1. First, determine which stores have the most load on them (or overfull
   -- but for the prototype I only considered the one dimension that
   affects TPC-C the most)
2. Decide whether the most loaded stores are so overloaded that action
   needs to be taken.
3. Examine the hottest replicas on the store (maybe not the absolute
   hottest in practice, since moving that one could disrupt user traffic,
   but in the prototype this seems to work fine) and attempt to move them
   to under-utilized stores.  If this can be done simply by transferring
   leases to under-utilized stores, then do so. If moving leases isn't
   enough, then also rebalance replicas from the hottest store to
   under-utilized stores.
4. Repeat periodically to handle changes in load or cluster membership.

In a real versino of this code, the plan is roughly:
1. Each store will independently run their own control loop like this
   that is only responsible for moving leases/replicas off itself, not off
   other stores. This avoids needing a centralized coordinator, and will
   avoid the need to use the raft debug endpoint as long as we start
   gossiping QPS per store info, since the store already has details about
   the replicas on itself.
2. The existing replicate queue will stop making decisions motivated by
   balance. It will switch to only making decisions based on
   constraints/diversity/lease preferences, which is still needed since
   the new store-level logic will only check for store-level balance,
   not that all replicas' constraints are properly met.
3. The new code will have to avoid violating constraints/diversity/lease
   preferences.
4. The new code should consider range count, disk fullness, and maybe
   writes per second as well.
5. In order to avoid making decisions based on bad data, I'd like to
   extend lease transfers to pass along QPS data to the new leaseholder
   and preemptive snapshots to pass along WPS data to the new replica.
   This may not be strictly necessary, as shown by the success of this
   prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing cockroachdb#26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches cockroachdb#26059, cockroachdb#17979

Release note: None

[prototype] storage: Extend new allocator to also move range replicas

With this update, TPC-C 10k on 30 went from overloaded to running at
peak efficiency over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison). This is without having to run the replica
scatter from cockroachdb#26438.

Doing a 5 minute run to get a result that doesn't include all the
rebalancing time shows:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  290.9s   124799.1  97.0%    548.6    486.5    872.4   1140.9   2281.7  10200.5

I think it may have a small bug in it still, since at one point early on
one of the replicas from the warehouse table on the node doing the
relocating thought that it had 16-17k QPS, which wasn't true by any
other metric in the system. Restarting the node fixed it though.
I'm not too concerned about the bug, since I assume I just made a code
mistake, not that anything about the approach fundamentally leads to a
random SQL table replica gets 10s of thousands of QPS.

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Release note: None

Release note: None
@a-robinson
Copy link
Contributor Author

Consider this investigated. Resulted in #28340 and #28340 after the prototype in #26608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

7 participants