Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Closed
kennydo opened this issue May 18, 2023 · 13 comments · Fixed by #15985
Closed

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

kennydo opened this issue May 18, 2023 · 13 comments · Fixed by #15985

Comments

@kennydo
Copy link

kennydo commented May 18, 2023

What happened?

I tried out using the --experimental-compact-hash-check-enabled flag on my etcd cluster today, and I think I might have encountered a race condition when a server starts up/resumes a scheduled compaction that causes the hash value to be incorrect.

About my setup: I have a 5 member cluster, running as k8s pods. They already have the --experimental-initial-corrupt-check flag enabled.

I was adding the --experimental-compact-hash-check-enabled flag (and also upgrading from 3.5.6 to 3.5.9) on a 5-member node. For 3 of the 4 etcd members, I had already stopped the etcd pod, updated their container URLs to 3.5.9, added the new hash check flag, and then restarted the etcd pod. I then stopped the 4th follower (which I'll call FOO), edited its pod config, and brought it back up. I then stopped the leader, and did the same update/restart. This triggered a leader election, and one of the former followers (which I'll call BAR) became the new leader.

The new leader then performed the periodic compact hash check, as configured. But the strange thing is that, even though all of my pods were up and running (and had passed the initial corruption check), the very first periodic check failed!

Due to the liveness probe problem mentioned in #13340, I had to deal with my pods getting restarted by kube due to the alarm... But when I disarmed the CORRUPT alarm so my etcd pods could stay up and running, they all again passed the initial corrupt data check, and logged the same hashes for the following "storing new hash" logs. So that's why I suspect there was very unlucky timing and a race condition that caused FOO to calculate the wrong hash value and use that when BAR asked for its hash during that periodic compaction hash check.

What did you expect to happen?

I expected the periodic compaction hash check to succeed and for my cluster be healthy because:

  1. the initial data check passed on all members of the cluster
  2. the hash/compact-revision that FOO logged with the "storing new hash" line both before and after the problematic hash matched the hashes in the rest of the cluster

How can we reproduce it (as minimally and precisely as possible)?

I'm not sure how to reproduce it. Maybe create a multi-member cluster with enough data so that compaction takes a non-trivial amount of time, trigger a compaction, and then stop one of the members right in the middle of the compaction so that compaction fails and needs to resume compacting when it comes back up?

Anything else we need to know?

I saw #15548, but that mentions the cluster ID changing, and in this scenario where just I'm restarting etcd with different flags, I don't think the cluster ID would change.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.9
Git SHA: bdbbde998
Go Version: go1.19.9
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.9
API version: 3.5

Etcd configuration (command line flags or environment variables)

(from a k8s pod manifest)
  - command:
    - etcd
    - --advertise-client-urls=https://XXX:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-compact-hash-check-enabled=true
    - --initial-advertise-peer-urls=https://XXX:2380
    - --initial-cluster-token=XXX
    - --initial-cluster=XXX
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://XXX:2379
    - --listen-metrics-urls=http://127.0.0.1:2381,http://XXX:2381
    - --listen-peer-urls=https://XXX:2380
    - --name=XXX
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --auto-compaction-mode=periodic
    - --auto-compaction-retention=1h
    - --quota-backend-bytes=15032385536
    image: registry.k8s.io/etcd:3.5.9-0

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# Lost due to pods being restarted, but you can see the endpoints and IDs below

# This endpoint status table was captured while the cluster was in the alarmed state
$ etcdctl --endpoints $endpoints endpoint status -w table
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+
|      ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |     ERRORS     |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+
|  XXX:2379 | 367806663aa1996c |   3.5.9 |  4.9 GB |     false |      false |        30 |  718184797 |          718184797 | alarm:CORRUPT  |
| XXX:2379 | 6069c22b85ee10b0 |   3.5.9 |  4.8 GB |      true |      false |        30 |  718184798 |          718184798 | alarm:CORRUPT  |
| XXX:2379 | 917d2067aeb00315 |   3.5.9 |  4.8 GB |     false |      false |        30 |  718184802 |          718184802 | alarm:CORRUPT  |
| XXX:2379 | b2529a8db451f246 |   3.5.9 |  4.8 GB |     false |      false |        30 |  718184804 |          718184804 | alarm:CORRUPT  |
|  XXX:2379 | e56f12f0d98830bb |   3.5.9 |  4.9 GB |     false |      false |        30 |  718184805 |          718184805 | alarm:CORRUPT  |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+

Relevant log output

(not literal logs, but reconstructed from my investigation)

* 15:34:55 - all nodes log "storing new hash". revision=702700874 compact-revision=702656236 hash=483395635
* ... I update the other etcd nodes one by one
* 15:39:54 - FOO  starts shutting down
* 15:39:54 - every node but FOO logs "storing new hash" of hash=2533962327 for compact-revision=702700874 at revision=702739864.
* 15:39:54- FOO logs "Failed compaction" (because the etcd pod was terminating)
* 15:40:24 - Updated etcd pod on  FOO starts running
* 15:40:24 - FOO logs "resume scheduled compaction"
* 15:40:35 - FOO logs "storing new hash" of hash=2915350535 for compact-revision=702700874 at revision=702739864. (But every other node for this compact-revision had hash=2533962327)!
* 15:42:06 - BAR became leader
* 15:42:18 - BAR started compact hash check
* 15:42:18 -  BAR (leader) says "failed compaction hash check". For revision 702739864, leader had 2533962327, but follower FOO had hash 2915350535. BAR logged "alarm raised"
@ahrtr
Copy link
Member

ahrtr commented May 24, 2023

Thanks for raising this issue. It seems that the member automatically recovered from the "corrupted" status after you disarmed the CORRUPT alarm?

I think we can intentionally inject a failpoint during the compaction operation to try to reproduce this issue.

@kennydo
Copy link
Author

kennydo commented May 24, 2023

I'm not sure whether the resolution was directly triggered by disarming the CORRUPT alarm, or whether it unblocked something (ex: now that the cluster was in a healthy state again, one of the automatic compaction requests from kube apiserver could occur and update the value stored for the latest compacted hash.) But whatever happened after I disarmed it happened automatically from my perspective. The cluster went back to serving write requests just fine.

@CaojiamingAlan
Copy link
Contributor

CaojiamingAlan commented May 26, 2023

Seems that I've found the bug. It seems that when resuming the the compaction after recovery, the hasher does not fully hash every key and value, but starts from where it is interrupted(since the older values has been deleted), thus giving a different hash value.

The result of compaction won't be different, but the hash values diverge, as compaction only keeps the latest version, but the hasher uses all of the history.

Please assign this to me @ahrtr , I will provide tests to reproduce this issue and fix it.

Relevant code:

h := newKVHasher(prevCompactRev, compactMainRev, keep)
last := make([]byte, 8+1+8)
for {
var rev revision
start := time.Now()
tx := s.b.BatchTx()
tx.LockOutsideApply()
keys, values := tx.UnsafeRange(schema.Key, last, end, int64(batchNum))
for i := range keys {
rev = bytesToRev(keys[i])
if _, ok := keep[rev]; !ok {
tx.UnsafeDelete(schema.Key, keys[i])
keyCompactions++
}
h.WriteKeyValue(keys[i], values[i])
}

Here the stale value of a key is deleted in the process of hashing (line 57). If this is interrupted, the hasher can't get it anymore.

@ahrtr
Copy link
Member

ahrtr commented May 26, 2023

starts from where it is interrupted(since the older values has been deleted), thus giving a different hash value.

Good catch! It seems that the most feasible way is to just ignore the hash if the previous compaction was somehow interrupted, e.g the first revision != prevCompactRev.

@CaojiamingAlan
Copy link
Contributor

e.g the first revision != prevCompactRev.

@ahrtr There's a minor issue about it: during the very first compaction, the prevCompactRev = -1.

From the official documentation of etcd:

When the store is created, the initial revision is 1.

So, is it guaranteed that the first revision that CAN BE COMPACTED is 2, and the revision is always continuous? It agrees with my observation. If so, then when prevCompactRev = -1 and revision > 2, we can conclude that the hash is incomplete. Is that correct?

If it is not the case, my idea about this is to do a query and persist the first revision in the store when scheduling the compaction. When really starting to do the compaction, we check if the first revision in the store is still the same as the one persisted. If they are the same, then we can believe that the hash is complete; otherwise we drop the hash value. This approach does not depend on how etcd deals with revisions, as long as it increases monotonically.

CaojiamingAlan added a commit to CaojiamingAlan/etcd that referenced this issue May 29, 2023
…-io#15919.

If there is a gap between prevCompactRev and the smallest revision, then the compaction is continued from the middle. In this case, we cannot get the correct compaction hash, thus we drop it.

Signed-off-by: caojiamingalan <[email protected]>
@ahrtr
Copy link
Member

ahrtr commented May 30, 2023

and the revision is always continuous?

YES, it's.

my idea about this is to do a query and persist the first revision in the store when scheduling the compaction.

I think we just need to add a new field FirstRealRevision to record the real first compact revision. When a leader compares its local hash with any peer's hash, it should check both FirstRealRevision1 and CompactRevision at corrupt.go#L302. If any of them doesn't match, then we should skip the hash check. WDYT? @serathius

BTW, I propose to rename CompactRevision to PreviousCompactRevision or FromCompactRevision , and Revision to CurrentCompactRevision or ToCompactRevision as well, see below.

type KeyValueHash struct {
	Hash                    uint32
	FirstRealRevision       int64 // New added field to record the real first compacted revision
	PreviousCompactRevision int64 // Renamed from `CompactRevision`
	CurrentCompactRevision  int64 // Renamed from `Revision`
}

@serathius
Copy link
Member

@CaojiamingAlan Thanks for looking into the issue. The issue is corrected to resumed compaction reporting incorrect hash for incorrect revision range (compactRev, revision). I don't think we should use the hash calculated whn compaction is resumed mid way. Is there any problem with implementing just that?

@ahrtr
Copy link
Member

ahrtr commented May 30, 2023

@serathius . I see you disagree my comment above, but no detailed reasons (which I don't think is good). Actually I have two comments, I don't know which one you don't agree.

  1. Add a FirstRealRevision.
  2. Rename CompactRevision to FromCompactRevision, and Revision to ToCompactRevision. Actually each hash value is based on revision range (previousCompactionRev, currentCompactionRev). Renaming them can improve the readability. Of course, we can do this in a separate PR.

@serathius
Copy link
Member

I think for a short term fix for detecting data corruption implementation is complicated enough. Would prefer to just fix the standing issue and refocus the effort to #13839.

Renaming them can improve the readability

New proposed name are more confusing then before. For calculating hash during compaction there is no from and to. They are just CompactRevision and Revision from the time when compaction happen. Adding from and to would be misleading.

@ahrtr
Copy link
Member

ahrtr commented May 30, 2023

Would prefer to just fix the standing issue

If you read the history comments, most of them are discussing on fixing the issue.

They are just CompactRevision and Revision from the time when compaction happen. Adding from and to would be misleading.

Don't agree with this. For the latest hash value, the names CompactRevision and Revison make some sense, but not for historical hash values because you save up to 10 hashes. The only blocker for the renaming is you reuse the struct KeyValueHash for both CompactHashCheck and PeriodicCheck, because the renaming make sense for CompactHashCheck, but not for PeriodicCheck. I will not insist on renaming due to the blocker.

@CaojiamingAlan
Copy link
Contributor

I think we just need to add a new field FirstRealRevision to record the real first compact revision. When a leader compares its local hash with any peer's hash, it should check both FirstRealRevision1 and CompactRevision at corrupt.go#L302.

I think this is incorrect if we consider only the CompactionCheck. If one store is actually corrupted and drops some of the revisions, it cannot be detected: it is likely that the FirstRealRevisions are different for the corrupted store and the correct store, and the check is skipped. However, probably we can know this through PeriodicalCheck. I think the problem here is, is it allowed to have false negatives for each single corruption check. If it is, then this is a valid approach.

@ahrtr
Copy link
Member

ahrtr commented May 30, 2023

If one store is actually corrupted and drops some of the revisions, it cannot be detected: it is likely that the FirstRealRevisions are different for the corrupted store and the correct store, and the check is skipped.

It's a valid point. Then the simplest way is to compare ScheduledCompactKeyName and FinishedCompactKeyName before performing each compaction operation, if they don't match, then it means previous compaction somehow did not finish; accordingly we should skip calculating hash in such case.

CaojiamingAlan added a commit to CaojiamingAlan/etcd that referenced this issue May 31, 2023
before writing hash to hashstore. If they do not match, then it means this compaction is interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. See etcd-io#15919.

Also fix some typos and reorder the functions to improve readability.

Signed-off-by: caojiamingalan <[email protected]>
CaojiamingAlan added a commit to CaojiamingAlan/etcd that referenced this issue May 31, 2023
before writing hash to hashstore. If they do not match, then it means this compaction is interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. See etcd-io#15919.

Also fix some typos and reorder the functions to improve readability.

Signed-off-by: caojiamingalan <[email protected]>
CaojiamingAlan added a commit to CaojiamingAlan/etcd that referenced this issue Jun 12, 2023
…tKeyName before writing hash to release-3.5.

Fix etcd-io#15919.
Check ScheduledCompactKeyName and FinishedCompactKeyName
before writing hash to hashstore.
If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm.

Signed-off-by: caojiamingalan <[email protected]>
CaojiamingAlan added a commit to CaojiamingAlan/etcd that referenced this issue Jul 15, 2023
…tKeyName before writing hash to release-3.5.

Fix etcd-io#15919.
Check ScheduledCompactKeyName and FinishedCompactKeyName
before writing hash to hashstore.
If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm.

Signed-off-by: caojiamingalan <[email protected]>
@serathius
Copy link
Member

Should this be backported to v3.4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants