-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate hash during compaction #14049
Conversation
Codecov Report
@@ Coverage Diff @@
## main #14049 +/- ##
==========================================
- Coverage 75.20% 75.05% -0.16%
==========================================
Files 451 453 +2
Lines 36791 36873 +82
==========================================
+ Hits 27669 27675 +6
- Misses 7392 7446 +54
- Partials 1730 1752 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
It should have no performance loss after a quick review. Will have a deeper review when the PR is marked as ready for review. |
cfbb49f
to
a46abfc
Compare
keys, _ := tx.UnsafeRange(schema.Key, last, end, int64(batchNum)) | ||
for _, key := range keys { | ||
rev = bytesToRev(key) | ||
keys, values := tx.UnsafeRange(schema.Key, last, end, int64(batchNum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see that end
is defined as:
binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))
I'm afraid this logic would miss all entries created after the most recent compaction... but maybe I'm missing sth...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compactMainRev
is revision that is being compacted. This is our goal, calculate hash at the same time as compaction is being done, thus not increasing number of times entry is touched. We don't care about entries after most recent compaction.
server/storage/mvcc/hash.go
Outdated
@@ -22,7 +22,7 @@ import ( | |||
"go.etcd.io/etcd/server/v3/storage/schema" | |||
) | |||
|
|||
func unsafeHashByRev(tx backend.ReadTx, lower, upper revision, keep map[revision]struct{}) (uint32, error) { | |||
func unsafeHashByRev(tx backend.ReadTx, lower, upper int64, keep map[revision]struct{}) (uint32, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish revision was strongly typed:
type Revision int64
such that it's hard to miss-type between int and Revision.
Probably for another refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the good point. Here I wanted to stop using revision
type to ensure that only main revision is set. revision
type allows setting sub revision could lead to hard to predict side effects.
I will take a deep review once it's rebased to resolve the conflict. |
Done |
f995f8b
to
8f6924c
Compare
Added changes needed to expose compact hash in HashByRev function and implemented integration tests. |
) | ||
s.hashMu.Lock() | ||
defer s.hashMu.Unlock() | ||
s.hashes = append(s.hashes, hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a member gets restarted? all hashes will get lost? Should we persist the hashes into db
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully correct solution would persist hashes into WAL
entries, however we cannot backport this change. You are right, not persisting hashes would mean that restarts are disruptive, however I prefer to have this flaw than develop a temporary solution.
Not every mechanism in etcd works well with restarts. Definitely consistency of KV store needs to be maintained independent of restarts, however secondary mechanism like leases will not work if etcd is frequently restarted. It's up to admin to ensure that etcd is not too frequently restarted so leases work well. I think it's reasonable to assume that etcd doesn't restart for non-critical features to work.
For this v3.5 backport I would treat correctness checking as secondary mechanism that requires etcd not to restart and for v3.6 I would look into making the mechanism fully independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will think about this and get back later.
Fully correct solution would persist hashes into
WAL
entries
Not sure what do you mean? Do you mean you want to depend on WAL replaying on startup to recover the hashes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we will introduce new WAL entry with Hash. Leader would periodically calculate it hash and send it as a raft proposal. When applying followers would verify if their hash matches one in raft.
That would be a mid term solution before implementing merkele root #13839
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can work, but looks like it doesn't make much sense. The raft is just like a network transport channel in this case instead of a consensus protocol. The leader just transports the hash to followers. Why not use the existing peer communication mechanism?
But it might be useful on the solution to implement merkele root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea to use raft to deliver hash is based on fact that there is no other way to guarantee member synchronization. Main flow of the current system is drift between members causing making it unreliable to query about particular revision. Leader can never check hash on very slow followers as they will not have requested revision. This effort is a workaround for this by increasing some history of hashes allowing leader to more reliably query about them.
Main reason to verify hash via raft is fact that it be in WAL log. It allows hash check to be part of sequential WAL entry apply logic. This means that it will be ordered relative to KV changes. Benefits:
- hash check execution will be guaranteed to succeed (not influenced by drift)
- hash check will be on latest revision
The only downside of this is that calculating hash will be costly, meaning it will not make sense to run it frequently (only once couple of seconds/minutes depending on cluster backup setup). For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.
The historical data has already been compacted, how can you perform the point-in-time recovery from data corruption? It seems that we can only recover the data from the backup of the db files. I assume we need to deliver a separate tool or new commands in etcdctl or etcdutl to do it in future. Please let me know if I missed anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need to get this sorted out before merging this PR.
c2ae790
to
e321717
Compare
As corruption code is super delicate and I didn't want to take a chance, I added a set of unit test to cover all branches of the code. All other changes are just rebasing on the refactor needed for unit test. |
Signed-off-by: Marek Siarkowicz <[email protected]>
Get 100% coverage on InitialCheck and PeriodicCheck functions to avoid any mistakes. Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Makes it easier to test hash match between scheduleCompaction and HashByRev. Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
… Defrag Signed-off-by: Marek Siarkowicz <[email protected]>
Plan to backport when whole solution merged to main branch. |
Implements low cost hash calculation during compaction as proposed in #14039
This PR was deliberately split into smaller commits to avoid any mistakes in code. Recommend to review it one by one.
To make sure that hash calculated by hashKV and scheduleCompaction is the same, I needed to change API so that it's always accompanied by revision range it was calculated on, which is change from previous API that always returned latest revision (WAT?).
I also introduced a intense tests for hash function on a first commit to make sure I will not change result of hashing function.
This PR is big even though it's meant to be backported to v3.5, this is by design as I don't think we could make a quick hacky change and not make mistake. I included some refactors that should make code clear to understood.
cc @ahrtr @ptabor