Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the potential data loss for clusters with only one member #14394

Closed
wants to merge 1 commit into from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Aug 27, 2022

Fix #14370

For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:

  1. etcd commits the boltDB transaction periodically instead of on each request;
  2. etcd saves WAL entries in parallel with applying the committed entries.
    Accordingly, it may run into a situation of data loss when the etcd crashes
    immediately after responding to the client and before the boltDB and WAL
    successfully save the data to disk.
    Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang [email protected]

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@codecov-commenter
Copy link

codecov-commenter commented Aug 27, 2022

Codecov Report

Merging #14394 (3243706) into main (f56e0d0) will increase coverage by 0.04%.
The diff coverage is 95.55%.

@@            Coverage Diff             @@
##             main   #14394      +/-   ##
==========================================
+ Coverage   75.34%   75.38%   +0.04%     
==========================================
  Files         457      457              
  Lines       37185    37208      +23     
==========================================
+ Hits        28016    28049      +33     
+ Misses       7405     7394      -11     
- Partials     1764     1765       +1     
Flag Coverage Δ
all 75.38% <95.55%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/server.go 85.59% <91.66%> (+0.18%) ⬆️
server/etcdserver/raft.go 89.41% <100.00%> (+0.78%) ⬆️
client/v3/leasing/util.go 91.66% <0.00%> (-6.67%) ⬇️
client/v3/leasing/cache.go 87.77% <0.00%> (-3.89%) ⬇️
client/pkg/v3/testutil/recorder.go 76.27% <0.00%> (-3.39%) ⬇️
pkg/traceutil/trace.go 96.15% <0.00%> (-1.93%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 69.56% <0.00%> (-1.74%) ⬇️
client/v3/leasing/kv.go 89.70% <0.00%> (-1.67%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 76.56% <0.00%> (-1.05%) ⬇️
server/etcdserver/corrupt.go 88.77% <0.00%> (-0.67%) ⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ahrtr
Copy link
Member Author

ahrtr commented Aug 27, 2022

cc @ptabor @serathius @spzala This might be an important fix, please take a look, thx.

Note that clusters with only one member isn't recommended in production usage in the existing official releases, including 3.5.[0-4] and 3.4.x, because it may cause data loss when etcd crashes and under high load. cc @dims @liggitt

server/etcdserver/raft.go Outdated Show resolved Hide resolved
// It further means the data must have been committed.
// Note: for clusters with multiple members, the raft will never send identical
// unstable entries and committed entries to etcdserver.
func shouldWaitWALSync(unstableEntries []raftpb.Entry, committedEntries []raftpb.Entry) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to have a single code path that is safe regardless of whether we're in multi-server or single server mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get your point.

For multi-member cluster, there is no need to wait for the WAL sync, and this function will always return false.

Copy link
Member Author

@ahrtr ahrtr Aug 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two solutions in my mind before delivering this PR.

The first solution is to enhance the existing raft protocol. The existing raft workflow commit each log immediately when it receives each proposal for clusters with only member, because it doesn't need to get confirmation from itself. Accordingly it sends identical unstable logs and committed logs to etcdserver. The solution is to send a message to etcdserver and wait for the confirmation, no matter it's single-server or multi-server. The good side of this solution is that it looks elegant. The bad side it has some impact on the performance, and it also needs to update the stable raft package. It might be what your a single code path means.

The second solution is what this PR delivers. The good side is that it has little performance impact, and no impact on multi-server clusters at all. The bad side is that it complicates the applying workflow, but it should be accepted.

Eventually I followed the second solution above for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix by @ahrtr makes sense for me.

@ahrtr ahrtr force-pushed the one_member_data_loss branch 3 times, most recently from 61dca0c to db01837 Compare August 28, 2022 06:36
@ahrtr ahrtr force-pushed the one_member_data_loss branch from db01837 to 2b2bb3e Compare August 28, 2022 21:49
@dims
Copy link
Contributor

dims commented Aug 28, 2022

cc @chaochn47 @geetasg

@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

Performance comparison

Linux server configuration

MemTotal:       16423532 kB
16CPU, and each with Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

Commands:

./etcd  --quota-backend-bytes=4300000000  
./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240  --total=200000 --rate=40000

Result on one-server cluster

Note that I tried multiple times, and got stable results.

Result on main

Summary:
  Total:	56.1029 secs.
  Slowest:	0.1439 secs.
  Fastest:	0.0021 secs.
  Average:	0.0559 secs.
  Stddev:	0.0289 secs.
  Requests/sec:	3564.8803

Response time histogram:
  0.0021 [1]	|
  0.0163 [20555]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0304 [14212]	|∎∎∎∎∎∎∎∎∎
  0.0446 [38302]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0588 [56887]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0730 [12060]	|∎∎∎∎∎∎∎∎
  0.0872 [16407]	|∎∎∎∎∎∎∎∎∎∎∎
  0.1013 [25805]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1155 [11993]	|∎∎∎∎∎∎∎∎
  0.1297 [3452]	|∎∎
  0.1439 [326]	|

Latency distribution:
  10% in 0.0161 secs.
  25% in 0.0361 secs.
  50% in 0.0506 secs.
  75% in 0.0785 secs.
  90% in 0.0981 secs.
  95% in 0.1087 secs.
  99% in 0.1203 secs.
  99.9% in 0.1360 secs.

Result on branch one_member_data_loss

Summary:
  Total:	59.1221 secs.
  Slowest:	0.1435 secs.
  Fastest:	0.0128 secs.
  Average:	0.0590 secs.
  Stddev:	0.0213 secs.
  Requests/sec:	3382.8273

Response time histogram:
  0.0128 [1]	|
  0.0259 [1029]	|
  0.0390 [36291]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0520 [64848]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0651 [25016]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0782 [26937]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0912 [25053]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1043 [16544]	|∎∎∎∎∎∎∎∎∎∎
  0.1174 [3387]	|∎∎
  0.1305 [764]	|
  0.1435 [130]	|

Latency distribution:
  10% in 0.0361 secs.
  25% in 0.0417 secs.
  50% in 0.0516 secs.
  75% in 0.0754 secs.
  90% in 0.0920 secs.
  95% in 0.0965 secs.
  99% in 0.1116 secs.
  99.9% in 0.1267 secs.

Summary

Overall the performance decreases by about 5.38% ((3564 - 3382) / 3382).

@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

Result on three-server cluster

Commands:

$ goreman start  

$ ./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=10240  --total=100000 --rate=40000

Result on main

Summary:
  Total:	58.5007 secs.
  Slowest:	0.2334 secs.
  Fastest:	0.0135 secs.
  Average:	0.1168 secs.
  Stddev:	0.0349 secs.
  Requests/sec:	1709.3817

Response time histogram:
  0.0135 [1]	|
  0.0355 [166]	|
  0.0575 [2216]	|∎∎∎
  0.0795 [16128]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1015 [19326]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1235 [14502]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1455 [25602]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1674 [15279]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1894 [5343]	|∎∎∎∎∎∎∎∎
  0.2114 [1140]	|∎
  0.2334 [297]	|

Latency distribution:
  10% in 0.0719 secs.
  25% in 0.0857 secs.
  50% in 0.1208 secs.
  75% in 0.1420 secs.
  90% in 0.1606 secs.
  95% in 0.1727 secs.
  99% in 0.1920 secs.
  99.9% in 0.2294 secs.

Result on branch one_member_data_loss

Summary:
  Total:	57.8285 secs.
  Slowest:	0.2478 secs.
  Fastest:	0.0114 secs.
  Average:	0.1155 secs.
  Stddev:	0.0338 secs.
  Requests/sec:	1729.2500

Response time histogram:
  0.0114 [1]	|
  0.0350 [48]	|
  0.0587 [2551]	|∎∎∎
  0.0823 [20336]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1060 [16035]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1296 [21643]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1532 [26055]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1769 [11141]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.2005 [1703]	|∎∎
  0.2241 [465]	|
  0.2478 [22]	|

Latency distribution:
  10% in 0.0714 secs.
  25% in 0.0839 secs.
  50% in 0.1199 secs.
  75% in 0.1410 secs.
  90% in 0.1594 secs.
  95% in 0.1664 secs.
  99% in 0.1883 secs.
  99.9% in 0.2166 secs.

Summary

Overall the performance results are the same.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

Points:

  1. There is no any performance impact on multi-server cluster.
  2. There is slightly downgrade (about 5.38%) of the performance for single-server cluster. Correctness takes precedence over performance. So I think the PR is accepted, and should be cherry picked to 3.5, and probably 3.4.

@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 29, 2022
@serathius
Copy link
Member

Makes sense, looks like the performance regression is mostly visible in 10 percentile of latency distribution. I would expect that much lower 10%ile benefited from lack of durability. I think it's reasonable to trade latency for durability for those requests.

I support backporting this change as it is for v3.4 and v3.5

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however let's wait for more maintainers to have a look

For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:
   1. etcd commits the boltDB transaction periodically instead of on each request;
   2. etcd saves WAL entries in parallel with applying the committed entries.
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr ahrtr force-pushed the one_member_data_loss branch from 2b2bb3e to 3243706 Compare August 29, 2022 07:51
@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

LGTM, however let's wait for more maintainers to have a look

Thanks @serathius for the quick review. Please @ptabor and @spzala take a look, thx

@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

Makes sense, looks like the performance regression is mostly visible in 10 percentile of latency distribution. I would expect that much lower 10%ile benefited from lack of durability. I think it's reasonable to trade latency for durability for those requests.

I support backporting this change as it is for v3.4 and v3.5

ack. One more point, the faster the disk I/O, the smaller the performance downgrade. It means when the disk I/O is faster enough, then the performance downgrade should be even smaller.

@serathius
Copy link
Member

I will work with K8s Scalability folks do validate this change for K8s.

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrtr thanks for the great work and benchmark results! We need to add an entry to changelog for this, but that can be done separately. Also, the backport approach sounds good, thanks @serathius

@ahrtr
Copy link
Member Author

ahrtr commented Aug 30, 2022

@ahrtr thanks for the great work and benchmark results! We need to add an entry to changelog for this, but that can be done separately. Also, the backport approach sounds good, thanks @serathius

Thanks @spzala . We need to make a decision to merge this one or #14400. Either way, I will update the changelog in separate PR, and after backporting the PR.

@ahrtr ahrtr mentioned this pull request Aug 31, 2022
@ahrtr
Copy link
Member Author

ahrtr commented Sep 5, 2022

Closing this PR because we eventually merged #14400 .

@ahrtr ahrtr closed this Sep 5, 2022
tbg added a commit to tbg/etcd that referenced this pull request Sep 19, 2022
I ran this PR against its main merge-base twice (on my 2021 Mac M1 pro),
and in both cases this PR was slightly faster, using the benchmark
invocation from [^1].

2819.6 vs 2808.4
2873.1 vs 2835

Full output below.

----

Script:

```
killall etcd
rm -rf default.etcd
scripts/build.sh
nohup ./bin/etcd  --quota-backend-bytes=4300000000 &
sleep 10
f=bench-$(git log -1 --pretty=%s | sed -E 's/[^A-Za-z0-9]+/_/g').txt
go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240  --total=200000 --rate=40000 | tee "${f}"
```

PR:

```
Summary:
  Total:	70.9320 secs.
  Slowest:	0.3003 secs.
  Fastest:	0.0044 secs.
  Average:	0.0707 secs.
  Stddev:	0.0437 secs.
  Requests/sec:	2819.6030 (second run: 2873.0935)

Response time histogram:
  0.0044 [1]	|
  0.0340 [2877]	|
  0.0636 [119485]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0932 [17436]	|∎∎∎∎∎
  0.1228 [27364]	|∎∎∎∎∎∎∎∎∎
  0.1524 [20349]	|∎∎∎∎∎∎
  0.1820 [10214]	|∎∎∎
  0.2116 [1248]	|
  0.2412 [564]	|
  0.2707 [318]	|
  0.3003 [144]	|

Latency distribution:
  10% in 0.0368 secs.
  25% in 0.0381 secs.
  50% in 0.0416 secs.
  75% in 0.0998 secs.
  90% in 0.1375 secs.
  95% in 0.1571 secs.
  99% in 0.1850 secs.
  99.9% in 0.2650 secs.
```

main:

```
Summary:
  Total:	71.2152 secs.
  Slowest:	0.6926 secs.
  Fastest:	0.0040 secs.
  Average:	0.0710 secs.
  Stddev:	0.0461 secs.
  Requests/sec:	2808.3903 (second run: 2834.98)

Response time histogram:
  0.0040 [1]	|
  0.0728 [125816]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1417 [59127]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.2105 [13476]	|∎∎∎∎
  0.2794 [1125]	|
  0.3483 [137]	|
  0.4171 [93]	|
  0.4860 [193]	|
  0.5549 [4]	|
  0.6237 [16]	|
  0.6926 [12]	|

Latency distribution:
  10% in 0.0367 secs.
  25% in 0.0379 secs.
  50% in 0.0417 secs.
  75% in 0.0993 secs.
  90% in 0.1367 secs.
  95% in 0.1567 secs.
  99% in 0.1957 secs.
  99.9% in 0.4361 secs.
```

[^1]: etcd-io#14394 (comment)

Signed-off-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/etcd that referenced this pull request Sep 20, 2022
I ran this PR against its main merge-base twice (on my 2021 Mac M1 pro),
and in both cases this PR was slightly faster, using the benchmark
invocation from [^1].

2819.6 vs 2808.4
2873.1 vs 2835

Full output below.

----

Script:

```
killall etcd
rm -rf default.etcd
scripts/build.sh
nohup ./bin/etcd  --quota-backend-bytes=4300000000 &
sleep 10
f=bench-$(git log -1 --pretty=%s | sed -E 's/[^A-Za-z0-9]+/_/g').txt
go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240  --total=200000 --rate=40000 | tee "${f}"
```

PR:

```
Summary:
  Total:	70.9320 secs.
  Slowest:	0.3003 secs.
  Fastest:	0.0044 secs.
  Average:	0.0707 secs.
  Stddev:	0.0437 secs.
  Requests/sec:	2819.6030 (second run: 2873.0935)

Response time histogram:
  0.0044 [1]	|
  0.0340 [2877]	|
  0.0636 [119485]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0932 [17436]	|∎∎∎∎∎
  0.1228 [27364]	|∎∎∎∎∎∎∎∎∎
  0.1524 [20349]	|∎∎∎∎∎∎
  0.1820 [10214]	|∎∎∎
  0.2116 [1248]	|
  0.2412 [564]	|
  0.2707 [318]	|
  0.3003 [144]	|

Latency distribution:
  10% in 0.0368 secs.
  25% in 0.0381 secs.
  50% in 0.0416 secs.
  75% in 0.0998 secs.
  90% in 0.1375 secs.
  95% in 0.1571 secs.
  99% in 0.1850 secs.
  99.9% in 0.2650 secs.
```

main:

```
Summary:
  Total:	71.2152 secs.
  Slowest:	0.6926 secs.
  Fastest:	0.0040 secs.
  Average:	0.0710 secs.
  Stddev:	0.0461 secs.
  Requests/sec:	2808.3903 (second run: 2834.98)

Response time histogram:
  0.0040 [1]	|
  0.0728 [125816]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1417 [59127]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.2105 [13476]	|∎∎∎∎
  0.2794 [1125]	|
  0.3483 [137]	|
  0.4171 [93]	|
  0.4860 [193]	|
  0.5549 [4]	|
  0.6237 [16]	|
  0.6926 [12]	|

Latency distribution:
  10% in 0.0367 secs.
  25% in 0.0379 secs.
  50% in 0.0417 secs.
  75% in 0.0993 secs.
  90% in 0.1367 secs.
  95% in 0.1567 secs.
  99% in 0.1957 secs.
  99.9% in 0.4361 secs.
```

[^1]: etcd-io#14394 (comment)

Signed-off-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Development

Successfully merging this pull request may close these issues.

Durability API guarantee broken in single node cluster
7 participants