schema: latency spike with new index backfill on tpc-c 10k (with partitioning) #36925

awoods187 · 2019-04-17T22:43:36Z

Describe the problem

I ran tpc-c 10k (with partitioning set) and passed with 97% efficiency:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s   124949.7  97.2%    268.7    251.7    520.1    604.0    838.9   3623.9

Then I ran a schema change:
CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);

I immediately saw a latency spike + QPS drop:

I tried running again after the schema change finished:

To Reproduce

set -euxo pipefail
export CLUSTER=andy-10k
roachprod create $CLUSTER -n 16 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER --  "sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-15 cockroach c65b71a27e4d0941bf9427b5dec1ff7f096bba7b
roachprod stage $CLUSTER:16 workload
roachprod start $CLUSTER:1-15 --racks=5 
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
roachprod run $CLUSTER:16 "./workload run tpcc --warehouses=10000 --partitions=5 --duration=1s --split --scatter {pgurl:1}"
roachprod sql $CLUSTER:1
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='64MiB'
Wait for replication queue to go to 0
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='8MiB'
roachprod run $CLUSTER:16 "./workload run tpcc --ramp=10m --warehouses=10000 --active-warehouses=10000 --partitions=5 --duration=15m --split --scatter {pgurl:1-15}"

Environment:
v19.1.0-rc.3-11-gc65b71a

Debug.zip
https://drive.google.com/open?id=17ZYGOYU7Dpo7Yta6KhHTAf0nUFyRZSgg

Extract from the latency spike till now.
extract.txt

Epic CRDB-8816

Jira issue: CRDB-4469

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-04-18T01:11:14Z

I dropped the index and set the cluster back up to run tpc-c. I confirmed that it passed again. Then, I ran tpc-c and killed one node:

We see a momentary spike of latency but then it comes gracefully back down.
Here's a six hour view to compare the behavior:

cc @bdarnell what do you think about this? The cluster is definitely worse with a schema change as opposed to with a dead node.

bdarnell · 2019-04-18T01:50:11Z

So the schema change was running from about 20:20 to 21:10, and then from 21:30 to 22:00 the schema change was no longer running and you were just seeing the impact of the new index? This seems to show that the schema change isn't at fault, but the new schema is. You're adding a new unpartitioned, time-ordered index (o_entry_d is a timestamp) to a partitioned table. This will destroy all the parallelism that the partitioning is trying to create, so it's unsurprising that there's a high cost.

awoods187 · 2019-04-18T03:07:55Z

Ahh the old partitioning the index ux surprise. I want to run an experiment with the schema change done with no load and then running the load to see what happens. But the index is hung for some reason.

goroutines.txt

Next up, if the index completes, I'll partition the index and run.

But, I think the schema change would have to be run first and then partitioned after it's run. So it would have this large initial latency spike regardless of if partitioning solves the sustained latency problem.

awoods187 · 2019-04-23T02:02:32Z

Alright, I set up a new cluster to test this out with a different schema change:

CREATE UNIQUE INDEX ON tpcc.order (o_w_id ASC, o_d_id ASC, o_carrier_id ASC, o_id ASC) PARTITION BY RANGE (o_w_id) (                                                                                            PARTITION p4_0 VALUES FROM (0) TO (2000),                                                                                                                                                                                           PARTITION p4_1 VALUES FROM (2000) TO (4000),                                                                                                                                                                                        PARTITION p4_2 VALUES FROM (4000) TO (6000),                                                                                                                                                                                        PARTITION p4_3 VALUES FROM (6000) TO (8000),                                                                                                                                                                                        PARTITION p4_4 VALUES FROM (8000) TO (10000));

Immediately the latency spikes to 10+s and stays spiked while QPS drops by about 1/3. Interestingly, the latency on some nodes remind at 200ms.

During this same time period we see CPU go down during the initial schema change followed by an increase on some nodes with a rather large spread.

The QPS is low and choppy until I initiated a cancel of the schema change job a little after 00:10. The cancel and rollback job runs for 16+minutes until it completes at little before 00:30.

Similarly, the CPU pattern continues until I canceled the schema change. Once I did, the CPU becomes much tighter in spread until the rollback completes:

awoods187 · 2019-04-23T02:22:12Z

I killed a node so that we could compare the behavior of the schema change to the cluster with a dead node:

awoods187 · 2019-04-23T02:36:49Z

Debug.zip https://drive.google.com/open?id=1Ql9bhbQ-PrXKOvp9JMauBOoLXtJn1pao
Merged logs for the schema change duration from debug.zip
extract.txt.zip

github-actions · 2023-09-20T11:09:34Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

awoods187 assigned vivekmenezes Apr 17, 2019

awoods187 added the A-schema-changes label Apr 17, 2019

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 17, 2019

ajwerner mentioned this issue Mar 10, 2020

sql: large latency spikes when creating a large storing index while running tpcc #45888

Closed

dt mentioned this issue Apr 8, 2020

bulk, *: latency impact of background jobs on foreground traffic #47215

Closed

kenliu added the T-disaster-recovery label Dec 5, 2020

dt unassigned vivekmenezes Jun 1, 2021

github-actions bot added the no-issue-activity label Sep 20, 2023

github-actions bot added the X-stale label Oct 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

exalate-issue-sync bot closed this as completed Oct 2, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: latency spike with new index backfill on tpc-c 10k (with partitioning) #36925

schema: latency spike with new index backfill on tpc-c 10k (with partitioning) #36925

awoods187 commented Apr 17, 2019 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Apr 18, 2019

bdarnell commented Apr 18, 2019

awoods187 commented Apr 18, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019

awoods187 commented Apr 23, 2019

github-actions bot commented Sep 20, 2023

schema: latency spike with new index backfill on tpc-c 10k (with partitioning) #36925

schema: latency spike with new index backfill on tpc-c 10k (with partitioning) #36925

Comments

awoods187 commented Apr 17, 2019 • edited by cockroach-jira-scripts Loading

awoods187 commented Apr 18, 2019

bdarnell commented Apr 18, 2019

awoods187 commented Apr 18, 2019 • edited Loading

awoods187 commented Apr 23, 2019 • edited Loading

awoods187 commented Apr 23, 2019

awoods187 commented Apr 23, 2019

github-actions bot commented Sep 20, 2023

awoods187 commented Apr 17, 2019 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Apr 18, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading