-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect hash when resuming scheduled compaction after etcd restarts #15919
Comments
Thanks for raising this issue. It seems that the member automatically recovered from the "corrupted" status after you disarmed the CORRUPT alarm? I think we can intentionally inject a failpoint during the compaction operation to try to reproduce this issue. |
I'm not sure whether the resolution was directly triggered by disarming the CORRUPT alarm, or whether it unblocked something (ex: now that the cluster was in a healthy state again, one of the automatic compaction requests from kube apiserver could occur and update the value stored for the latest compacted hash.) But whatever happened after I disarmed it happened automatically from my perspective. The cluster went back to serving write requests just fine. |
Seems that I've found the bug. It seems that when resuming the the compaction after recovery, the hasher does not fully hash every key and value, but starts from where it is interrupted(since the older values has been deleted), thus giving a different hash value. The result of compaction won't be different, but the hash values diverge, as compaction only keeps the latest version, but the hasher uses all of the history. Please assign this to me @ahrtr , I will provide tests to reproduce this issue and fix it. Relevant code: etcd/server/storage/mvcc/kvstore_compaction.go Lines 44 to 61 in 7cc98e6
Here the stale value of a key is deleted in the process of hashing (line 57). If this is interrupted, the hasher can't get it anymore. |
Good catch! It seems that the most feasible way is to just ignore the hash if the previous compaction was somehow interrupted, e.g the first revision != prevCompactRev. |
@ahrtr There's a minor issue about it: during the very first compaction, the prevCompactRev = -1. From the official documentation of etcd:
So, is it guaranteed that the first revision that CAN BE COMPACTED is 2, and the revision is always continuous? It agrees with my observation. If so, then when prevCompactRev = -1 and revision > 2, we can conclude that the hash is incomplete. Is that correct? If it is not the case, my idea about this is to do a query and persist the first revision in the store when scheduling the compaction. When really starting to do the compaction, we check if the first revision in the store is still the same as the one persisted. If they are the same, then we can believe that the hash is complete; otherwise we drop the hash value. This approach does not depend on how etcd deals with revisions, as long as it increases monotonically. |
…-io#15919. If there is a gap between prevCompactRev and the smallest revision, then the compaction is continued from the middle. In this case, we cannot get the correct compaction hash, thus we drop it. Signed-off-by: caojiamingalan <[email protected]>
YES, it's.
I think we just need to add a new field BTW, I propose to rename
|
@CaojiamingAlan Thanks for looking into the issue. The issue is corrected to resumed compaction reporting incorrect hash for incorrect revision range (compactRev, revision). I don't think we should use the hash calculated whn compaction is resumed mid way. Is there any problem with implementing just that? |
@serathius . I see you disagree my comment above, but no detailed reasons (which I don't think is good). Actually I have two comments, I don't know which one you don't agree.
|
I think for a short term fix for detecting data corruption implementation is complicated enough. Would prefer to just fix the standing issue and refocus the effort to #13839.
New proposed name are more confusing then before. For calculating hash during compaction there is no |
If you read the history comments, most of them are discussing on fixing the issue.
Don't agree with this. For the latest hash value, the names |
I think this is incorrect if we consider only the CompactionCheck. If one store is actually corrupted and drops some of the revisions, it cannot be detected: it is likely that the FirstRealRevisions are different for the corrupted store and the correct store, and the check is skipped. However, probably we can know this through PeriodicalCheck. I think the problem here is, is it allowed to have false negatives for each single corruption check. If it is, then this is a valid approach. |
It's a valid point. Then the simplest way is to compare ScheduledCompactKeyName and FinishedCompactKeyName before performing each compaction operation, if they don't match, then it means previous compaction somehow did not finish; accordingly we should skip calculating hash in such case. |
before writing hash to hashstore. If they do not match, then it means this compaction is interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. See etcd-io#15919. Also fix some typos and reorder the functions to improve readability. Signed-off-by: caojiamingalan <[email protected]>
before writing hash to hashstore. If they do not match, then it means this compaction is interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. See etcd-io#15919. Also fix some typos and reorder the functions to improve readability. Signed-off-by: caojiamingalan <[email protected]>
…tKeyName before writing hash to release-3.5. Fix etcd-io#15919. Check ScheduledCompactKeyName and FinishedCompactKeyName before writing hash to hashstore. If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. Signed-off-by: caojiamingalan <[email protected]>
…tKeyName before writing hash to release-3.5. Fix etcd-io#15919. Check ScheduledCompactKeyName and FinishedCompactKeyName before writing hash to hashstore. If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. Signed-off-by: caojiamingalan <[email protected]>
Should this be backported to v3.4? |
What happened?
I tried out using the
--experimental-compact-hash-check-enabled
flag on my etcd cluster today, and I think I might have encountered a race condition when a server starts up/resumes a scheduled compaction that causes the hash value to be incorrect.About my setup: I have a 5 member cluster, running as k8s pods. They already have the
--experimental-initial-corrupt-check
flag enabled.I was adding the
--experimental-compact-hash-check-enabled
flag (and also upgrading from 3.5.6 to 3.5.9) on a 5-member node. For 3 of the 4 etcd members, I had already stopped the etcd pod, updated their container URLs to 3.5.9, added the new hash check flag, and then restarted the etcd pod. I then stopped the 4th follower (which I'll call FOO), edited its pod config, and brought it back up. I then stopped the leader, and did the same update/restart. This triggered a leader election, and one of the former followers (which I'll call BAR) became the new leader.The new leader then performed the periodic compact hash check, as configured. But the strange thing is that, even though all of my pods were up and running (and had passed the initial corruption check), the very first periodic check failed!
Due to the liveness probe problem mentioned in #13340, I had to deal with my pods getting restarted by kube due to the alarm... But when I disarmed the CORRUPT alarm so my etcd pods could stay up and running, they all again passed the initial corrupt data check, and logged the same hashes for the following "storing new hash" logs. So that's why I suspect there was very unlucky timing and a race condition that caused FOO to calculate the wrong hash value and use that when BAR asked for its hash during that periodic compaction hash check.
What did you expect to happen?
I expected the periodic compaction hash check to succeed and for my cluster be healthy because:
How can we reproduce it (as minimally and precisely as possible)?
I'm not sure how to reproduce it. Maybe create a multi-member cluster with enough data so that compaction takes a non-trivial amount of time, trigger a compaction, and then stop one of the members right in the middle of the compaction so that compaction fails and needs to resume compacting when it comes back up?
Anything else we need to know?
I saw #15548, but that mentions the cluster ID changing, and in this scenario where just I'm restarting etcd with different flags, I don't think the cluster ID would change.
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: