Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: MaxSizePerMsg should not be used to limit CommittedEntry pagination #31511

Closed
nvanbenschoten opened this issue Oct 16, 2018 · 1 comment
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@nvanbenschoten
Copy link
Member

See #31330 (comment) for the origin of this discussion.

We should add a new configuration to etcd/raft that splits the overloaded roles of MaxSizePerMsg. MaxSizePerMsg can keep its original role and a new MaxCommitedSizePerReady config should be introduced. We can then set this configuration dramatically higher than MaxSizePerMsg. 32-64MB would be a good place to start.

This has already proven to speed up bulk insertion throughput on a small number of Ranges by as much as 20%.

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Oct 16, 2018
@ajwerner
Copy link
Contributor

ajwerner commented Nov 14, 2018

With the config separated out and the new MaxCommitedSizePerReady set to 64MB I see an increase in throughput of ~13% over 6, 5 minute runs per build of the below, somewhat pathological, workload.

./workload run kv '{pgurl:1-3}' --init --min-block-bytes 8193 --max-block-bytes 16385 --read-percent=0
name       old ops/s  new ops/s  delta
Cockroach   181 ± 3%   205 ± 1%  +13.29%  (p=0.002 n=6+6)

I'll submit a PR to etcd/raft now and then when that lands, update the config in storage/store.go

ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 15, 2018
Before this change, the size of log committed log entries which a replica could
apply at a time was bound to the same configuration as the total size of log
entries which could be sent in a message (MaxSizePerMsg) which is generally
kilobytes. This limit had an impact on the throughput of writes to a replica,
particularly when writing large amounts of data. A new raft configuration option
MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258)
which allows these two size parameters to be decoupled. This change adopts the
configuration and sets it to a default of 64MB.

On the below workload which is set up to always return exactly one entry per
Ready wiht the old configuration we see a massive win in both throughput and
latency.

```
./workload run kv {pgurl:1-3} \
    --init --splits=10 \
    --duration 60s \
    --read-percent=${READ_PERCENT} \
    --min-block-bytes=8193 --max-block-bytes=16385 \
    --concurrency=1024
```

```
name       old ops/s  new ops/s  delta
KV0         483 ± 3%  2025 ± 3%  +319.32%  (p=0.002 n=6+6)
```

Before:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0          29570          492.8   1981.2   2281.7   5100.3   5637.1   6442.5  write
   60.0s        0          28405          473.4   2074.8   2281.7   5637.1   6710.9   7516.2  write
   60.0s        0          28615          476.9   2074.3   2550.1   5905.6   6442.5   8321.5  write
   60.0s        0          28718          478.6   2055.4   2550.1   5100.3   6442.5   7516.2  write
   60.0s        0          28567          476.1   2079.8   2684.4   4831.8   5368.7   6442.5  write
   60.0s        0          29981          499.7   1975.7   1811.9   5368.7   6174.0   6979.3  write
```

After:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0         119652         1994.0    510.9    486.5   1006.6   1409.3   4295.0  write
   60.0s        0         125321         2088.4    488.5    469.8    906.0   1275.1   4563.4  write
   60.0s        0         119644         1993.9    505.2    469.8   1006.6   1610.6   5637.1  write
   60.0s        0         119027         1983.6    511.4    469.8   1073.7   1946.2   4295.0  write
   60.0s        0         121723         2028.5    500.6    469.8   1040.2   1677.7   4160.7  write
   60.0s        0         123697         2061.4    494.1    469.8   1006.6   1610.6   4295.0  write
```

Fixes cockroachdb#31511

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 15, 2018
Before this change, the size of log committed log entries which a replica could
apply at a time was bound to the same configuration as the total size of log
entries which could be sent in a message (MaxSizePerMsg) which is generally
kilobytes. This limit had an impact on the throughput of writes to a replica,
particularly when writing large amounts of data. A new raft configuration option
MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258)
which allows these two size parameters to be decoupled. This change adopts the
configuration and sets it to a default of 64MB.

On the below workload which is set up to always return exactly one entry per
Ready with the old configuration we see a massive win in both throughput and
latency.

```
./workload run kv {pgurl:1-3} \
    --init --splits=10 \
    --duration 60s \
    --read-percent=${READ_PERCENT} \
    --min-block-bytes=8193 --max-block-bytes=16385 \
    --concurrency=1024
```

```
name       old ops/s  new ops/s  delta
KV0         483 ± 3%  2025 ± 3%  +319.32%  (p=0.002 n=6+6)
```

Before:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0          29570          492.8   1981.2   2281.7   5100.3   5637.1   6442.5  write
   60.0s        0          28405          473.4   2074.8   2281.7   5637.1   6710.9   7516.2  write
   60.0s        0          28615          476.9   2074.3   2550.1   5905.6   6442.5   8321.5  write
   60.0s        0          28718          478.6   2055.4   2550.1   5100.3   6442.5   7516.2  write
   60.0s        0          28567          476.1   2079.8   2684.4   4831.8   5368.7   6442.5  write
   60.0s        0          29981          499.7   1975.7   1811.9   5368.7   6174.0   6979.3  write
```

After:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0         119652         1994.0    510.9    486.5   1006.6   1409.3   4295.0  write
   60.0s        0         125321         2088.4    488.5    469.8    906.0   1275.1   4563.4  write
   60.0s        0         119644         1993.9    505.2    469.8   1006.6   1610.6   5637.1  write
   60.0s        0         119027         1983.6    511.4    469.8   1073.7   1946.2   4295.0  write
   60.0s        0         121723         2028.5    500.6    469.8   1040.2   1677.7   4160.7  write
   60.0s        0         123697         2061.4    494.1    469.8   1006.6   1610.6   4295.0  write
```

Fixes cockroachdb#31511

Release note: None
craig bot pushed a commit that referenced this issue Nov 15, 2018
32387: storage: adopt new raft MaxCommittedSizePerReady config parameter r=ajwerner a=ajwerner

Before this change, the size of log committed log entries which a replica could
apply at a time was bound to the same configuration as the total size of log
entries which could be sent in a message (MaxSizePerMsg) which is generally
kilobytes. This limit had an impact on the throughput of writes to a replica,
particularly when writing large amounts of data. A new raft configuration option
MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258)
which allows these two size parameters to be decoupled. This change adopts the
configuration and sets it to a default of 64MB.

On the below workload which is set up to always return exactly one entry per
Ready wiht the old configuration we see a massive win in both throughput and
latency.

```
./workload run kv {pgurl:1-3} \
    --init --splits=10 \
    --duration 60s \
    --read-percent=${READ_PERCENT} \
    --min-block-bytes=8193 --max-block-bytes=16385 \
    --concurrency=1024
```

```
name       old ops/s  new ops/s  delta
KV0         483 ± 3%  2025 ± 3%  +319.32%  (p=0.002 n=6+6)
```

Before:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0          29570          492.8   1981.2   2281.7   5100.3   5637.1   6442.5  write
   60.0s        0          28405          473.4   2074.8   2281.7   5637.1   6710.9   7516.2  write
   60.0s        0          28615          476.9   2074.3   2550.1   5905.6   6442.5   8321.5  write
   60.0s        0          28718          478.6   2055.4   2550.1   5100.3   6442.5   7516.2  write
   60.0s        0          28567          476.1   2079.8   2684.4   4831.8   5368.7   6442.5  write
   60.0s        0          29981          499.7   1975.7   1811.9   5368.7   6174.0   6979.3  write
```

After:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0         119652         1994.0    510.9    486.5   1006.6   1409.3   4295.0  write
   60.0s        0         125321         2088.4    488.5    469.8    906.0   1275.1   4563.4  write
   60.0s        0         119644         1993.9    505.2    469.8   1006.6   1610.6   5637.1  write
   60.0s        0         119027         1983.6    511.4    469.8   1073.7   1946.2   4295.0  write
   60.0s        0         121723         2028.5    500.6    469.8   1040.2   1677.7   4160.7  write
   60.0s        0         123697         2061.4    494.1    469.8   1006.6   1610.6   4295.0  write
```

Fixes #31511

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in #32387 Nov 15, 2018
ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 18, 2018
Before this change, the size of log committed log entries which a replica could
apply at a time was bound to the same configuration as the total size of log
entries which could be sent in a message (MaxSizePerMsg) which is generally
kilobytes. This limit had an impact on the throughput of writes to a replica,
particularly when writing large amounts of data. A new raft configuration option
MaxCommittedSizePerReady was adding to etcd/raft in (etcd-io/etcd#10258)
which allows these two size parameters to be decoupled. This change adopts the
configuration and sets it to a default of 64MB.

On the below workload which is set up to always return exactly one entry per
Ready with the old configuration we see a massive win in both throughput and
latency.

```
./workload run kv {pgurl:1-3} \
    --init --splits=10 \
    --duration 60s \
    --read-percent=${READ_PERCENT} \
    --min-block-bytes=8193 --max-block-bytes=16385 \
    --concurrency=1024
```

```
name       old ops/s  new ops/s  delta
KV0         483 ± 3%  2025 ± 3%  +319.32%  (p=0.002 n=6+6)
```

Before:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0          29570          492.8   1981.2   2281.7   5100.3   5637.1   6442.5  write
   60.0s        0          28405          473.4   2074.8   2281.7   5637.1   6710.9   7516.2  write
   60.0s        0          28615          476.9   2074.3   2550.1   5905.6   6442.5   8321.5  write
   60.0s        0          28718          478.6   2055.4   2550.1   5100.3   6442.5   7516.2  write
   60.0s        0          28567          476.1   2079.8   2684.4   4831.8   5368.7   6442.5  write
   60.0s        0          29981          499.7   1975.7   1811.9   5368.7   6174.0   6979.3  write
```

After:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0         119652         1994.0    510.9    486.5   1006.6   1409.3   4295.0  write
   60.0s        0         125321         2088.4    488.5    469.8    906.0   1275.1   4563.4  write
   60.0s        0         119644         1993.9    505.2    469.8   1006.6   1610.6   5637.1  write
   60.0s        0         119027         1983.6    511.4    469.8   1073.7   1946.2   4295.0  write
   60.0s        0         121723         2028.5    500.6    469.8   1040.2   1677.7   4160.7  write
   60.0s        0         123697         2061.4    494.1    469.8   1006.6   1610.6   4295.0  write
```

Fixes cockroachdb#31511

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

2 participants