raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

nvanbenschoten · 2018-10-16T18:27:16Z

See #31330 (comment) for the origin of this discussion.

We should add a new configuration to etcd/raft that splits the overloaded roles of MaxSizePerMsg. MaxSizePerMsg can keep its original role and a new MaxCommitedSizePerReady config should be introduced. We can then set this configuration dramatically higher than MaxSizePerMsg. 32-64MB would be a good place to start.

This has already proven to speed up bulk insertion throughput on a small number of Ranges by as much as 20%.

The text was updated successfully, but these errors were encountered:

ajwerner · 2018-11-14T14:54:08Z

With the config separated out and the new MaxCommitedSizePerReady set to 64MB I see an increase in throughput of ~13% over 6, 5 minute runs per build of the below, somewhat pathological, workload.

./workload run kv '{pgurl:1-3}' --init --min-block-bytes 8193 --max-block-bytes 16385 --read-percent=0

name       old ops/s  new ops/s  delta
Cockroach   181 ± 3%   205 ± 1%  +13.29%  (p=0.002 n=6+6)

I'll submit a PR to etcd/raft now and then when that lands, update the config in storage/store.go

Before this change, the size of log committed log entries which a replica could apply at a time was bound to the same configuration as the total size of log entries which could be sent in a message (MaxSizePerMsg) which is generally kilobytes. This limit had an impact on the throughput of writes to a replica, particularly when writing large amounts of data. A new raft configuration option MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258) which allows these two size parameters to be decoupled. This change adopts the configuration and sets it to a default of 64MB. On the below workload which is set up to always return exactly one entry per Ready wiht the old configuration we see a massive win in both throughput and latency. ``` ./workload run kv {pgurl:1-3} \ --init --splits=10 \ --duration 60s \ --read-percent=${READ_PERCENT} \ --min-block-bytes=8193 --max-block-bytes=16385 \ --concurrency=1024 ``` ``` name old ops/s new ops/s delta KV0 483 ± 3% 2025 ± 3% +319.32% (p=0.002 n=6+6) ``` Before: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 29570 492.8 1981.2 2281.7 5100.3 5637.1 6442.5 write 60.0s 0 28405 473.4 2074.8 2281.7 5637.1 6710.9 7516.2 write 60.0s 0 28615 476.9 2074.3 2550.1 5905.6 6442.5 8321.5 write 60.0s 0 28718 478.6 2055.4 2550.1 5100.3 6442.5 7516.2 write 60.0s 0 28567 476.1 2079.8 2684.4 4831.8 5368.7 6442.5 write 60.0s 0 29981 499.7 1975.7 1811.9 5368.7 6174.0 6979.3 write ``` After: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 119652 1994.0 510.9 486.5 1006.6 1409.3 4295.0 write 60.0s 0 125321 2088.4 488.5 469.8 906.0 1275.1 4563.4 write 60.0s 0 119644 1993.9 505.2 469.8 1006.6 1610.6 5637.1 write 60.0s 0 119027 1983.6 511.4 469.8 1073.7 1946.2 4295.0 write 60.0s 0 121723 2028.5 500.6 469.8 1040.2 1677.7 4160.7 write 60.0s 0 123697 2061.4 494.1 469.8 1006.6 1610.6 4295.0 write ``` Fixes cockroachdb#31511 Release note: None

Before this change, the size of log committed log entries which a replica could apply at a time was bound to the same configuration as the total size of log entries which could be sent in a message (MaxSizePerMsg) which is generally kilobytes. This limit had an impact on the throughput of writes to a replica, particularly when writing large amounts of data. A new raft configuration option MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258) which allows these two size parameters to be decoupled. This change adopts the configuration and sets it to a default of 64MB. On the below workload which is set up to always return exactly one entry per Ready with the old configuration we see a massive win in both throughput and latency. ``` ./workload run kv {pgurl:1-3} \ --init --splits=10 \ --duration 60s \ --read-percent=${READ_PERCENT} \ --min-block-bytes=8193 --max-block-bytes=16385 \ --concurrency=1024 ``` ``` name old ops/s new ops/s delta KV0 483 ± 3% 2025 ± 3% +319.32% (p=0.002 n=6+6) ``` Before: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 29570 492.8 1981.2 2281.7 5100.3 5637.1 6442.5 write 60.0s 0 28405 473.4 2074.8 2281.7 5637.1 6710.9 7516.2 write 60.0s 0 28615 476.9 2074.3 2550.1 5905.6 6442.5 8321.5 write 60.0s 0 28718 478.6 2055.4 2550.1 5100.3 6442.5 7516.2 write 60.0s 0 28567 476.1 2079.8 2684.4 4831.8 5368.7 6442.5 write 60.0s 0 29981 499.7 1975.7 1811.9 5368.7 6174.0 6979.3 write ``` After: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 119652 1994.0 510.9 486.5 1006.6 1409.3 4295.0 write 60.0s 0 125321 2088.4 488.5 469.8 906.0 1275.1 4563.4 write 60.0s 0 119644 1993.9 505.2 469.8 1006.6 1610.6 5637.1 write 60.0s 0 119027 1983.6 511.4 469.8 1073.7 1946.2 4295.0 write 60.0s 0 121723 2028.5 500.6 469.8 1040.2 1677.7 4160.7 write 60.0s 0 123697 2061.4 494.1 469.8 1006.6 1610.6 4295.0 write ``` Fixes cockroachdb#31511 Release note: None

32387: storage: adopt new raft MaxCommittedSizePerReady config parameter r=ajwerner a=ajwerner Before this change, the size of log committed log entries which a replica could apply at a time was bound to the same configuration as the total size of log entries which could be sent in a message (MaxSizePerMsg) which is generally kilobytes. This limit had an impact on the throughput of writes to a replica, particularly when writing large amounts of data. A new raft configuration option MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258) which allows these two size parameters to be decoupled. This change adopts the configuration and sets it to a default of 64MB. On the below workload which is set up to always return exactly one entry per Ready wiht the old configuration we see a massive win in both throughput and latency. ``` ./workload run kv {pgurl:1-3} \ --init --splits=10 \ --duration 60s \ --read-percent=${READ_PERCENT} \ --min-block-bytes=8193 --max-block-bytes=16385 \ --concurrency=1024 ``` ``` name old ops/s new ops/s delta KV0 483 ± 3% 2025 ± 3% +319.32% (p=0.002 n=6+6) ``` Before: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 29570 492.8 1981.2 2281.7 5100.3 5637.1 6442.5 write 60.0s 0 28405 473.4 2074.8 2281.7 5637.1 6710.9 7516.2 write 60.0s 0 28615 476.9 2074.3 2550.1 5905.6 6442.5 8321.5 write 60.0s 0 28718 478.6 2055.4 2550.1 5100.3 6442.5 7516.2 write 60.0s 0 28567 476.1 2079.8 2684.4 4831.8 5368.7 6442.5 write 60.0s 0 29981 499.7 1975.7 1811.9 5368.7 6174.0 6979.3 write ``` After: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 119652 1994.0 510.9 486.5 1006.6 1409.3 4295.0 write 60.0s 0 125321 2088.4 488.5 469.8 906.0 1275.1 4563.4 write 60.0s 0 119644 1993.9 505.2 469.8 1006.6 1610.6 5637.1 write 60.0s 0 119027 1983.6 511.4 469.8 1073.7 1946.2 4295.0 write 60.0s 0 121723 2028.5 500.6 469.8 1040.2 1677.7 4160.7 write 60.0s 0 123697 2061.4 494.1 469.8 1006.6 1610.6 4295.0 write ``` Fixes #31511 Release note: None Co-authored-by: Andrew Werner <[email protected]>

Before this change, the size of log committed log entries which a replica could apply at a time was bound to the same configuration as the total size of log entries which could be sent in a message (MaxSizePerMsg) which is generally kilobytes. This limit had an impact on the throughput of writes to a replica, particularly when writing large amounts of data. A new raft configuration option MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258) which allows these two size parameters to be decoupled. This change adopts the configuration and sets it to a default of 64MB. On the below workload which is set up to always return exactly one entry per Ready with the old configuration we see a massive win in both throughput and latency. ``` ./workload run kv {pgurl:1-3} \ --init --splits=10 \ --duration 60s \ --read-percent=${READ_PERCENT} \ --min-block-bytes=8193 --max-block-bytes=16385 \ --concurrency=1024 ``` ``` name old ops/s new ops/s delta KV0 483 ± 3% 2025 ± 3% +319.32% (p=0.002 n=6+6) ``` Before: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 29570 492.8 1981.2 2281.7 5100.3 5637.1 6442.5 write 60.0s 0 28405 473.4 2074.8 2281.7 5637.1 6710.9 7516.2 write 60.0s 0 28615 476.9 2074.3 2550.1 5905.6 6442.5 8321.5 write 60.0s 0 28718 478.6 2055.4 2550.1 5100.3 6442.5 7516.2 write 60.0s 0 28567 476.1 2079.8 2684.4 4831.8 5368.7 6442.5 write 60.0s 0 29981 499.7 1975.7 1811.9 5368.7 6174.0 6979.3 write ``` After: ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 60.0s 0 119652 1994.0 510.9 486.5 1006.6 1409.3 4295.0 write 60.0s 0 125321 2088.4 488.5 469.8 906.0 1275.1 4563.4 write 60.0s 0 119644 1993.9 505.2 469.8 1006.6 1610.6 5637.1 write 60.0s 0 119027 1983.6 511.4 469.8 1073.7 1946.2 4295.0 write 60.0s 0 121723 2028.5 500.6 469.8 1040.2 1677.7 4160.7 write 60.0s 0 123697 2061.4 494.1 469.8 1006.6 1610.6 4295.0 write ``` Fixes cockroachdb#31511 Release note: None

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Oct 16, 2018

nvanbenschoten assigned nvanbenschoten and ajwerner Nov 11, 2018

ajwerner mentioned this issue Nov 14, 2018

raft: separate MaxCommittedSizePerReady config from MaxSizePerMsg etcd-io/etcd#10258

Merged

ajwerner mentioned this issue Nov 15, 2018

storage: adopt new raft MaxCommittedSizePerReady config parameter #32387

Merged

craig bot closed this as completed in #32387 Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

nvanbenschoten commented Oct 16, 2018

ajwerner commented Nov 14, 2018 •

edited

Loading

raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

Comments

nvanbenschoten commented Oct 16, 2018

ajwerner commented Nov 14, 2018 • edited Loading

ajwerner commented Nov 14, 2018 •

edited

Loading