Calculate hash during compaction #14049

serathius · 2022-05-17T15:26:31Z

Implements low cost hash calculation during compaction as proposed in #14039

This PR was deliberately split into smaller commits to avoid any mistakes in code. Recommend to review it one by one.

To make sure that hash calculated by hashKV and scheduleCompaction is the same, I needed to change API so that it's always accompanied by revision range it was calculated on, which is change from previous API that always returned latest revision (WAT?).

I also introduced a intense tests for hash function on a first commit to make sure I will not change result of hashing function.

This PR is big even though it's meant to be backported to v3.5, this is by design as I don't think we could make a quick hacky change and not make mistake. I included some refactors that should make code clear to understood.

cc @ahrtr @ptabor

codecov-commenter · 2022-05-17T16:02:28Z

Codecov Report

Merging #14049 (735faeb) into main (908faa4) will decrease coverage by 0.15%.
The diff coverage is 85.24%.

@@            Coverage Diff             @@
##             main   #14049      +/-   ##
==========================================
- Coverage   75.20%   75.05%   -0.16%     
==========================================
  Files         451      453       +2     
  Lines       36791    36873      +82     
==========================================
+ Hits        27669    27675       +6     
- Misses       7392     7446      +54     
- Partials     1730     1752      +22

Flag	Coverage Δ
all	`75.05% <85.24%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/storage/mvcc/kv.go	`40.00% <ø> (ø)`
server/etcdserver/corrupt.go	`56.47% <40.00%> (ø)`
server/storage/mvcc/kvstore.go	`87.58% <87.50%> (-0.73%)`	⬇️
server/storage/mvcc/kvstore_compaction.go	`95.65% <92.85%> (+0.78%)`	⬆️
server/storage/mvcc/testutil/hash.go	`95.74% <95.74%> (ø)`
server/etcdserver/api/v3rpc/maintenance.go	`75.34% <100.00%> (ø)`
server/storage/mvcc/hash.go	`100.00% <100.00%> (ø)`
server/auth/simple_token.go	`80.00% <0.00%> (-8.47%)`	⬇️
server/etcdserver/api/v3lock/lock.go	`61.53% <0.00%> (-8.47%)`	⬇️
server/etcdserver/api/v2stats/queue.go	`20.83% <0.00%> (-7.38%)`	⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 908faa4...735faeb. Read the comment docs.

server/storage/mvcc/kvstore_compaction.go

server/storage/mvcc/kvstore.go

ahrtr · 2022-05-17T17:46:50Z

It should have no performance loss after a quick review. Will have a deeper review when the PR is marked as ready for review.

serathius · 2022-05-19T16:36:45Z

Pushing PR for review. cc @ahrtr @ptabor
This one only implements calculating hash, I will prepare separate for validating hashes.

server/storage/mvcc/kvstore.go

server/storage/mvcc/kvstore_compaction.go

ptabor · 2022-05-20T12:18:45Z

server/storage/mvcc/kvstore_compaction.go

-		keys, _ := tx.UnsafeRange(schema.Key, last, end, int64(batchNum))
-		for _, key := range keys {
-			rev = bytesToRev(key)
+		keys, values := tx.UnsafeRange(schema.Key, last, end, int64(batchNum))


Please see that end is defined as:

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

I'm afraid this logic would miss all entries created after the most recent compaction... but maybe I'm missing sth...

compactMainRev is revision that is being compacted. This is our goal, calculate hash at the same time as compaction is being done, thus not increasing number of times entry is touched. We don't care about entries after most recent compaction.

ptabor · 2022-05-20T12:21:28Z

server/storage/mvcc/hash.go

@@ -22,7 +22,7 @@ import (
 	"go.etcd.io/etcd/server/v3/storage/schema"
 )

-func unsafeHashByRev(tx backend.ReadTx, lower, upper revision, keep map[revision]struct{}) (uint32, error) {
+func unsafeHashByRev(tx backend.ReadTx, lower, upper int64, keep map[revision]struct{}) (uint32, error) {


I wish revision was strongly typed:

type Revision int64 such that it's hard to miss-type between int and Revision.

Probably for another refactoring.

That's the good point. Here I wanted to stop using revision type to ensure that only main revision is set. revision type allows setting sub revision could lead to hard to predict side effects.

ahrtr · 2022-05-20T22:24:30Z

I will take a deep review once it's rebased to resolve the conflict.

serathius · 2022-05-25T08:50:28Z

Done

server/etcdserver/api/v3rpc/maintenance.go

server/storage/mvcc/kvstore.go

server/storage/mvcc/kvstore_compaction.go

server/storage/mvcc/hash.go

server/etcdserver/corrupt.go

serathius · 2022-06-07T14:40:24Z

Added changes needed to expose compact hash in HashByRev function and implemented integration tests.

server/storage/mvcc/hash_test.go

ahrtr · 2022-06-08T08:09:39Z

server/storage/mvcc/hash.go

+	)
+	s.hashMu.Lock()
+	defer s.hashMu.Unlock()
+	s.hashes = append(s.hashes, hash)


What if a member gets restarted? all hashes will get lost? Should we persist the hashes into db?

Fully correct solution would persist hashes into WAL entries, however we cannot backport this change. You are right, not persisting hashes would mean that restarts are disruptive, however I prefer to have this flaw than develop a temporary solution.

Not every mechanism in etcd works well with restarts. Definitely consistency of KV store needs to be maintained independent of restarts, however secondary mechanism like leases will not work if etcd is frequently restarted. It's up to admin to ensure that etcd is not too frequently restarted so leases work well. I think it's reasonable to assume that etcd doesn't restart for non-critical features to work.

For this v3.5 backport I would treat correctness checking as secondary mechanism that requires etcd not to restart and for v3.6 I would look into making the mechanism fully independent.

Will think about this and get back later.

Fully correct solution would persist hashes into WAL entries

Not sure what do you mean? Do you mean you want to depend on WAL replaying on startup to recover the hashes?

I mean that we will introduce new WAL entry with Hash. Leader would periodically calculate it hash and send it as a raft proposal. When applying followers would verify if their hash matches one in raft.

That would be a mid term solution before implementing merkele root #13839

It can work, but looks like it doesn't make much sense. The raft is just like a network transport channel in this case instead of a consensus protocol. The leader just transports the hash to followers. Why not use the existing peer communication mechanism?

But it might be useful on the solution to implement merkele root?

Idea to use raft to deliver hash is based on fact that there is no other way to guarantee member synchronization. Main flow of the current system is drift between members causing making it unreliable to query about particular revision. Leader can never check hash on very slow followers as they will not have requested revision. This effort is a workaround for this by increasing some history of hashes allowing leader to more reliably query about them.

Main reason to verify hash via raft is fact that it be in WAL log. It allows hash check to be part of sequential WAL entry apply logic. This means that it will be ordered relative to KV changes. Benefits:

hash check execution will be guaranteed to succeed (not influenced by drift)

hash check will be on latest revision

The only downside of this is that calculating hash will be costly, meaning it will not make sense to run it frequently (only once couple of seconds/minutes depending on cluster backup setup). For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.

For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.

The historical data has already been compacted, how can you perform the point-in-time recovery from data corruption? It seems that we can only recover the data from the backup of the db files. I assume we need to deliver a separate tool or new commands in etcdctl or etcdutl to do it in future. Please let me know if I missed anything.

It seems that we need to get this sorted out before merging this PR.

serathius · 2022-06-13T16:10:37Z

As corruption code is super delicate and I didn't want to take a chance, I added a set of unit test to cover all branches of the code. All other changes are just rebasing on the refactor needed for unit test.

Signed-off-by: Marek Siarkowicz <[email protected]>

Get 100% coverage on InitialCheck and PeriodicCheck functions to avoid any mistakes. Signed-off-by: Marek Siarkowicz <[email protected]>

Signed-off-by: Marek Siarkowicz <[email protected]>

Makes it easier to test hash match between scheduleCompaction and HashByRev. Signed-off-by: Marek Siarkowicz <[email protected]>

Signed-off-by: Marek Siarkowicz <[email protected]>

… Defrag Signed-off-by: Marek Siarkowicz <[email protected]>

serathius · 2022-06-21T09:04:29Z

Plan to backport when whole solution merged to main branch.

serathius force-pushed the compact-hash branch from 6ebe9bb to 7afe45a Compare May 17, 2022 15:28

serathius marked this pull request as draft May 17, 2022 15:28

serathius mentioned this pull request May 17, 2022

etcd detects data corruption by default #14039

Closed

ahrtr reviewed May 17, 2022

View reviewed changes

server/storage/mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved

server/storage/mvcc/kvstore.go Outdated Show resolved Hide resolved

serathius force-pushed the compact-hash branch 7 times, most recently from cfbb49f to a46abfc Compare May 19, 2022 16:19

serathius changed the title ~~[WIP] Calculate hash during compaction~~ Calculate hash during compaction May 19, 2022

serathius marked this pull request as ready for review May 19, 2022 16:35

ptabor reviewed May 20, 2022

View reviewed changes

server/storage/mvcc/kvstore.go Show resolved Hide resolved

ptabor reviewed May 20, 2022

View reviewed changes

serathius force-pushed the compact-hash branch from a46abfc to cbccf77 Compare May 20, 2022 12:55

serathius force-pushed the compact-hash branch from cbccf77 to ba811db Compare May 25, 2022 08:50

ahrtr reviewed May 31, 2022

View reviewed changes

serathius force-pushed the compact-hash branch 3 times, most recently from f995f8b to 8f6924c Compare June 7, 2022 14:39

ahrtr reviewed Jun 8, 2022

View reviewed changes

serathius force-pushed the compact-hash branch 2 times, most recently from c2ae790 to e321717 Compare June 8, 2022 11:24

serathius added the stage/merge-when-tests-green label Jun 13, 2022

serathius added 21 commits June 13, 2022 18:19

server: Extract corruption detection to dedicated struct

7c35dad

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Cover corruptionMonitor with tests

01e85be

Get 100% coverage on InitialCheck and PeriodicCheck functions to avoid any mistakes. Signed-off-by: Marek Siarkowicz <[email protected]>

server: Test HashByRev values to make sure they don't change

39c6935

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Extract unsafeHashByRev function

0984878

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Move unsafeHashByRev to new hash.go file

fcaf76d

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Extract kvHash struct

e62c358

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Refactor hasher

0f90359

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Return error from scheduleCompaction

638ca10

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Move reading KV index inside scheduleCompaction function

08a4317

Makes it easier to test hash match between scheduleCompaction and HashByRev. Signed-off-by: Marek Siarkowicz <[email protected]>

server: Fix range in mock not returning same number of keys and values

7381c81

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Calculate hash during compaction

887e53e

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Pass revision as int

1994165

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Move adjusting revision to hasher

8f7b053

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Store real rv range in hasher

76d3c52

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Return revision range that hash was calcualted for

34a02ba

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Remove duplicated compaction revision

80828b5

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Extract hasher to separate interface

2b090e8

Signed-off-by: Marek Siarkowicz <[email protected]>

server: Cache compaction hash for HashByRev API

0e739da

Signed-off-by: Marek Siarkowicz <[email protected]>

tests: Add integration tests for compact hash

1581eef

Signed-off-by: Marek Siarkowicz <[email protected]>

tests: Add tests for HashByRev HTTP API

f5eadf5

Signed-off-by: Marek Siarkowicz <[email protected]>

tests: Unify TestCompactionHash and extend it to also Delete keys and…

9612fc1

… Defrag Signed-off-by: Marek Siarkowicz <[email protected]>

serathius force-pushed the compact-hash branch from f618ff6 to 9612fc1 Compare June 13, 2022 16:20

serathius merged commit 2c12954 into etcd-io:main Jun 14, 2022

serathius added the backport/v3.5 label Jun 21, 2022

serathius mentioned this pull request Jun 21, 2022

Plans for v3.5.5 release #14138

Closed

16 tasks

serathius mentioned this pull request Jul 27, 2022

Fix corruption checks v3.5 #14282

Merged

serathius deleted the compact-hash branch June 15, 2023 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate hash during compaction #14049

Calculate hash during compaction #14049

serathius commented May 17, 2022 •

edited

Loading

codecov-commenter commented May 17, 2022 •

edited

Loading

ahrtr commented May 17, 2022

serathius commented May 19, 2022

ptabor May 20, 2022

serathius May 20, 2022

ptabor May 20, 2022

serathius May 20, 2022

ahrtr commented May 20, 2022

serathius commented May 25, 2022

serathius commented Jun 7, 2022

ahrtr Jun 8, 2022

serathius Jun 8, 2022

ahrtr Jun 8, 2022

serathius Jun 8, 2022 •

edited

Loading

ahrtr Jun 8, 2022

serathius Jun 9, 2022

ahrtr Jun 10, 2022

ahrtr Jun 10, 2022

serathius commented Jun 13, 2022

serathius commented Jun 21, 2022

Calculate hash during compaction #14049

Calculate hash during compaction #14049

Conversation

serathius commented May 17, 2022 • edited Loading

codecov-commenter commented May 17, 2022 • edited Loading

Codecov Report

ahrtr commented May 17, 2022

serathius commented May 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrtr commented May 20, 2022

serathius commented May 25, 2022

serathius commented Jun 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Jun 13, 2022

serathius commented Jun 21, 2022

serathius commented May 17, 2022 •

edited

Loading

codecov-commenter commented May 17, 2022 •

edited

Loading

serathius Jun 8, 2022 •

edited

Loading