Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc/backend: Fix corruption bug in defrag #11613

Merged
merged 1 commit into from
Feb 13, 2020

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Feb 12, 2020

If etcd is terminated during the defrag operation, the db.tmp file that it creates can be orphaned. If this happens, the next defragmentation operation that occurs will open the orphaned db.tmp instead of creating an empty db.tmp file, and starting with a fresh slate, as it should.

Once the defragmentation operation opens db.tmp , it traverses all key-values in the main db file and writes them to db.tmp. Any key-values already in the db.tmp file that are not overwritten by this copy remain in it, corrupting the boltdb keyspace. When the defragmentation operation completes successfully, db.tmp replaces db via file move and the main db file is now corrupt.

Impact:

  • the etcd keyspace can be corrupted, see mvcc/backend: Fix corruption bug in defrag #11613 (comment)
  • A deleted user or role can reappear (on one etcd member but not the others).
  • A etcd member can try to connect to a member that has been previously removed and is not considered part of the cluster anymore.
  • An expired lease can reappear. (In order for this to happen, the member that the defrag happens
    on must also become the leader.)

See #11613 (comment) and #11613 (comment) for examples of how to reproduce the issue.

There is a narrow window between when the bolt db transaction that populates db.tmp is committed and db.tmp is moved to replacedb file that etcd must be terminated in to trigger this.

The fix is simple: ensure the temporary file used for defragmentation is always a new, empty file.

cc @wenjiaswe @jingyih @YoyinZyc @gyuho

@codecov-io
Copy link

codecov-io commented Feb 12, 2020

Codecov Report

Merging #11613 into master will decrease coverage by 0.55%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11613      +/-   ##
==========================================
- Coverage   66.56%      66%   -0.56%     
==========================================
  Files         403      403              
  Lines       36630    37165     +535     
==========================================
+ Hits        24381    24529     +148     
- Misses      10768    11144     +376     
- Partials     1481     1492      +11
Impacted Files Coverage Δ
etcdserver/api/snap/snapshotter.go 66.93% <30%> (-3.24%) ⬇️
mvcc/backend/backend.go 80.95% <70%> (-0.26%) ⬇️
auth/options.go 37.5% <0%> (-55%) ⬇️
pkg/transport/timeout_conn.go 80% <0%> (-20%) ⬇️
auth/store.go 58.02% <0%> (-17.38%) ⬇️
client/client.go 73.52% <0%> (-10.46%) ⬇️
clientv3/leasing/util.go 91.66% <0%> (-6.67%) ⬇️
clientv3/namespace/watch.go 87.87% <0%> (-6.07%) ⬇️
etcdserver/api/v3rpc/watch.go 82.45% <0%> (-2.46%) ⬇️
etcdserver/v3_server.go 72.92% <0%> (-1.75%) ⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61f2794...213f7f7. Read the comment docs.

mvcc/backend/backend.go Outdated Show resolved Hide resolved
@gyuho
Copy link
Contributor

gyuho commented Feb 12, 2020

Good catch!

panic right before the "rename", that these things can actually happen.

Can we also add fail points? Something like:

        }
+
+       // gofail: var defragBeforeRename struct{}
        err = os.Rename(tdbp, dbp)
        if err != nil {
                if b.lg != nil {

Thanks!

/cc @xiang90

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

This might also be able to corrupt the keyspace, but I haven't been able find any ways to make that happen (in boltdb, the etcd keyspace is keyed by <revision, key> pairs, and any data older than the "last compacted" revision is ignored, so the data accidentally included from a previous defrag shouldn't matter).

I agree for the key bucket. Data accidentally included from a previous defrag orphaned file will be deleted in the next compaction.

For other buckets such as member and lease, they might cause damage?

@wenjiaswe
Copy link
Contributor

wenjiaswe commented Feb 12, 2020

Thanks Joe! Good catch, let's make sure we backport this to 3.2, 3.3 and 3.4, as well as changelog note.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 12, 2020

I agree for the key bucket. Data accidentally included from a previous defrag orphaned file will be deleted in the next compaction.

Yeah. Sounds like you have the same understanding as I. If this was to somehow modify the keyspace, the only way I can imagine it happening would be: a lease reappears due to this issue (that was previously expired) and somehow deleted keys it wasn't suppose to when it expired again, modifying the keyspace. Hopefully this is not possible?

For other buckets such as member and lease, they might cause damage?

It can, I've been able reproduce it on my local machine. It's pretty easy, steps are basically:

# put a panic before os.Rename() in defrag() in mvcc/backend/backend.go
goreman -f Procfile start
# create any leases, members, users... that you'd like to experiment with, e.g.:
etcdctl user add example:pass
etcdctl defrag
# etcd on 2379 will panic and orphan db.tmp
# ctrl-c to stop goreman
# remove the panic in defrag()
goreman -f Procfile start
# delete any leases, members, users... that you're experimenting with
etcdctl defrag
etcdctl user list
# example will appear in results, but just for 2379

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, /cc @wenjiaswe @jingyih

Let's backport this

thx!

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@xiang90
Copy link
Contributor

xiang90 commented Feb 12, 2020

lgtm

@jingyih
Copy link
Contributor

jingyih commented Feb 12, 2020

If this was to somehow modify the keyspace, the only way I can imagine it happening would be: a lease reappears due to this issue (that was previously expired) and somehow deleted keys it wasn't suppose to when it expired again, modifying the keyspace. Hopefully this is not possible?

Good catch. This sounds possible but unlikely to me.

  1. Lessor keeps a heap of leases in memory, from which it periodically pops out the expired leases and delete them in bbolt.
  2. Above happens only if node is leader.
  3. When restarting a node, lessor recovers leases from bbolt.
  4. A restarted node is unlikely to become leader in a cluster.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 12, 2020

OK. This can corrupt the keyspace. Here's how to reproduce:

# put a panic before os.Rename() in defrag() in mvcc/backend/backend.go to simulate etcd being terminated
goreman -f Procfile start # start a 3 member cluster

etcdctl put a 1
etcdctl defrag

# etcd on 2379 will panic and orphan db.tmp
# ctrl-c to stop goreman and the other 2 etcd members (they could be left running too, if you want)
# remove the panic in defrag()
goreman -f Procfile start # to start etcd member again

etcdctl del a
etcdctl put b
etcdctl get b -w json # to the the latest revision
etcdctl compact <latest-revision>
etcdctl defrag

# ctrl-c  to stop the instances again, this forces the in-memory index to be rebuild
goreman -f Procfile start # to start etcd member again

etcdctl get a
a
1

Thanks goes to @lavalamp for suggesting this approach.

@jpbetz jpbetz changed the title mvcc/backend: Delete orphaned db.tmp files before defrag mvcc/backend: Delete orphaned db.tmp files before defrag to prevent corruption Feb 12, 2020
mvcc/backend/backend.go Outdated Show resolved Hide resolved
mvcc/backend/backend.go Outdated Show resolved Hide resolved
@jpbetz jpbetz force-pushed the fix-defrag-orphan-file branch 3 times, most recently from f970561 to 9cb6fd5 Compare February 13, 2020 06:33
@jpbetz jpbetz changed the title mvcc/backend: Delete orphaned db.tmp files before defrag to prevent corruption mvcc/backend: Force defrag always start with an empty temp file to prevent corruption Feb 13, 2020
@jpbetz jpbetz changed the title mvcc/backend: Force defrag always start with an empty temp file to prevent corruption mvcc/backend: Force defrag to always start with an empty temp file to prevent corruption Feb 13, 2020
@jpbetz jpbetz changed the title mvcc/backend: Force defrag to always start with an empty temp file to prevent corruption mvcc/backend: Fix corruption bug in defrag Feb 13, 2020
@jingyih
Copy link
Contributor

jingyih commented Feb 13, 2020

Great work Joe!

So in your example, the deleted key a is brought back to persistent B+ tree by orphaned db.tmp file. And it is further brought back to in-memory index tree after a server restart.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 13, 2020

So in your example, the deleted key a is brought back to persistent B+ tree by orphaned db.tmp file. And it is further brought back to in-memory index tree after a server restart.

Yes, and note that once a bad db.tmp is written, it's just a matter of time for a busy, long lived etcd cluster to eventually be defragmented again, and after that eventually restarted. So for this bug, the only really unlikely event is that etcd is terminated abnormally after db.tmp is written to but before it is renamed to db, everything else is quite probable.

@jpbetz jpbetz merged commit 8bf7b2b into etcd-io:master Feb 13, 2020
wenjiaswe added a commit that referenced this pull request Feb 13, 2020
…-origin-release-3.3

Automated cherry pick of #11613 to release-3.3
wenjiaswe added a commit that referenced this pull request Feb 13, 2020
…-origin-release-3.2

Automated cherry pick of #11613 to release-3.2
jpbetz added a commit that referenced this pull request Feb 13, 2020
…-origin-release-3.4

Automated cherry pick of #11613 to release-3.4
jpbetz added a commit that referenced this pull request Feb 13, 2020
changelog: Add #11613 backport to 3.2, 3.3 and 3.4 changelogs
@andyliuliming
Copy link
Contributor

@gyuho @wenjiaswe may I ask when the new etcd version containing this fix been released?

@wenjiaswe
Copy link
Contributor

I would like to have that release too! also cc @hexfusion

@YoyinZyc
Copy link
Contributor

I would like a new 3.4 release including this backport. Thanks.

@socketpair
Copy link

So, what versions contain this fix ?

@wenjiaswe
Copy link
Contributor

wenjiaswe commented Oct 16, 2020

@socketpair It's in 3.2.29+, 3.3.19+ and 3.4.4+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

10 participants