Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one of the three nodes of etcd cluster has outdated data #8214

Closed
abel-von opened this issue Jul 6, 2017 · 36 comments
Closed

one of the three nodes of etcd cluster has outdated data #8214

abel-von opened this issue Jul 6, 2017 · 36 comments

Comments

@abel-von
Copy link

abel-von commented Jul 6, 2017

Bug reporting

hi, I have encountered a problem in my testing cluster. we deployed a kubernetes cluster, when we run "get pods" command in two different nodes, it display different results.

node1:

host-205-18-1-109:/var/paas/run/etcd # kubectl get pods --all-namespaces
NAMESPACE   NAME                                        READY     STATUS              RESTARTS   AGE
om          alm-almreceiver-3591156403-01dnt            0/1       CrashLoopBackOff    6          8d
om          alm-almreceiver-3591156403-cplpm            0/1       RunContainerError   5          8d
om          alm-forword-3473857616-46ghr                1/1       Running             0          8d
om          alm-forword-3473857616-b1zbn                1/1       Running             0          8d
om          als-config-2677484553-8rvg7                 0/1       RunContainerError   5          8d
om          als-config-2677484553-mq26v                 0/1       RunContainerError   5          8d
om          ams-access-1802240969-c46jz                 1/1       Running             0          8d
om          ams-access-1802240969-xc2t3                 1/1       Running             1          8d
om          ams-calc-1535546111-4p8c4                   0/1       CrashLoopBackOff    10         8d
om          ams-calc-1535546111-9m2c2                   0/1       RunContainerError   6          8d
om          ams-metric-2999871289-t1fhh                 0/1       RunContainerError   6          8d
om          ams-metric-2999871289-zbl5w                 0/1       RunContainerError   6          8d
om          ams-website-3385970016-w3mmv                1/1       Running             0          8d
om          ams-website-3385970016-whwmh                1/1       Running             0          8d
om          aos-apiserver-2725269985-5vhnc              1/1       Running             0          8d
om          aos-apiserver-2725269985-brl93              1/1       NodeLost            2          8d
om          aos-composer-2328970979-0kqhx               1/1       NodeLost            2          8d
om          aos-composer-2328970979-6rr1n               1/1       Running             0          8d
om          aos-dslparser-4285564324-9bhkf              1/1       Running             0          8d
om          aos-dslparser-4285564324-fnz0q              1/1       NodeLost            2          8d
om          aos-workflowengine-3966680679-2b4h8         1/1       Running             0          8d
om          aos-workflowengine-3966680679-pf8pr         1/1       NodeLost            2          8d
om          ats-core-3359653393-18npj                   0/1       CrashLoopBackOff    3          8d
om          ats-core-3359653393-vxx49                   0/1       CrashLoopBackOff    3          8d
om          audit-service-3566800368-0sf2w              1/1       Running             0          8d
om          audit-service-3566800368-wv25q              1/1       NodeLost            2          8d
om          audit-website-2460833792-11xk3              1/1       NodeLost            2          8d
om          audit-website-2460833792-8bvdw              1/1       Running             0          8d
om          cam-registry-2954984153-053vg               1/1       Running             1          8d
om          cam-registry-2954984153-2krx1               1/1       NodeLost            2          8d
om          cam-tiller-4260099647-2rlpx                 1/1       Running             0          8d
om          cam-tiller-4260099647-n5k1f                 1/1       NodeLost            2          8d
om          cron-3849778790-mbvc5                       1/1       Running             0          8d
om          cron-3849778790-v5xb7                       1/1       NodeLost            2          8d
om          cse-ber-ljsrl                               1/1       Running             6          8d
om          cse-ber-xs3k5                               1/1       NodeLost            2          8d
om          cse-bus-8vvd8                               1/1       Running             0          8d
om          cse-bus-bs6p5                               1/1       Running             0          8d

node2:

host-205-18-1-206:/home/cspexpert # /var/paas/kubernetes/kubectl get pods --all-namespaces                                        
NAMESPACE   NAME                                        READY     STATUS             RESTARTS   AGE
om          alm-almreceiver-3591156403-01dnt            1/1       Running            10         8d
om          alm-almreceiver-3591156403-cplpm            1/1       Running            9          8d
om          alm-forword-3473857616-46ghr                1/1       Running            0          8d
om          alm-forword-3473857616-b1zbn                1/1       Running            0          8d
om          als-config-2677484553-8rvg7                 1/1       Running            1640       8d
om          als-config-2677484553-mq26v                 1/1       Running            1660       8d
om          ams-access-1802240969-c46jz                 1/1       Running            0          8d
om          ams-access-1802240969-xc2t3                 1/1       Running            1          8d
om          ams-calc-1535546111-4p8c4                   1/1       Running            45         8d
om          ams-calc-1535546111-9m2c2                   1/1       Running            44         8d
om          ams-metric-2999871289-t1fhh                 1/1       Running            10         8d
om          ams-metric-2999871289-zbl5w                 1/1       Running            10         8d
om          ams-website-3385970016-w3mmv                1/1       Running            0          8d
om          ams-website-3385970016-whwmh                1/1       Running            0          8d
om          aos-apiserver-2725269985-5vhnc              1/1       Running            0          8d
om          aos-apiserver-2725269985-brl93              1/1       Running            5          8d
om          aos-composer-2328970979-0kqhx               1/1       Running            6          8d
om          aos-composer-2328970979-6rr1n               1/1       Running            0          8d
om          aos-dslparser-4285564324-9bhkf              1/1       Running            0          8d
om          aos-dslparser-4285564324-fnz0q              1/1       Running            5          8d
om          aos-workflowengine-3966680679-2b4h8         1/1       Running            0          8d
om          aos-workflowengine-3966680679-pf8pr         1/1       Running            5          8d
om          ats-core-3359653393-18npj                   1/1       Running            32         8d
om          ats-core-3359653393-vxx49                   1/1       Running            30         8d
om          audit-service-3566800368-0sf2w              1/1       Running            0          8d
om          audit-service-3566800368-wv25q              1/1       Running            5          8d
om          audit-website-2460833792-11xk3              1/1       Running            5          8d
om          audit-website-2460833792-8bvdw              1/1       Running            0          8d
om          cam-registry-2954984153-053vg               1/1       Running            1          8d
om          cam-registry-2954984153-2krx1               1/1       Running            5          8d
om          cam-tiller-4260099647-2rlpx                 1/1       Running            0          8d
om          cam-tiller-4260099647-n5k1f                 1/1       Running            5          8d
om          cron-3849778790-mbvc5                       1/1       Running            0          8d
om          cron-3849778790-v5xb7                       1/1       Running            5          8d
om          cse-ber-ljsrl                               1/1       Running            6          8d
om          cse-ber-xs3k5                               1/1       Running            5          8d
om          cse-bus-8vvd8                               1/1       Running            0          8d
om          cse-bus-bs6p5                               1/1       Running            0          8d

As we can see, the first result has many error pods but in the second result they are all running. and we check through docker command that pods are actually running.

then I checked the connection between kube-apiserver and etcd, and find that, on node2, one of etcd server(205.18.1.181) is not connected with kube-apiserver. while on node1(which displays wrong result), all three etcd nodes are connected.

we checked the endpoint status of etcd, the result is like below:

https://205.18.1.181:4001, db672ae3963ad927, 3.2.0+git, 807 MB, false, 258, 5905723
https://205.18.1.109:4001, d1703382376a5277, 3.2.0+git, 18 MB, false, 258, 5905723
https://205.18.1.206:4001, 1700753d3116a680, 3.2.0+git, 18 MB, true, 258, 5905723

as we can see that on 205.18.1.181, the data has size of 807 MB, while the other two nodes has data of 18MB. and I called "snapshot status" to check the db file:

on 205.18.1.181:

etcdctl snapshot status db --write-out=table                                                                                           
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 4a173aa0 |  1211536 |    1159021 |     810 MB |
+----------+----------+------------+------------+

on 205.18.1.109:

etcdctl snapshot status db --write-out=table                                                                       
+---------+----------+------------+------------+
|  HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+---------+----------+------------+------------+
| 364d4b2 |  1836531 |       1834 |      18 MB |
+---------+----------+------------+------------+

we have checked one key on both the two nodes by get /xxxx --consistency="s", and we found that the data in 205.18.1.181 is wrong and outdated.

I think it some bugs in the boltdb or something that produce this problem, but I can't figure out where exactly the bug is. so how can I dig into this and find the root cause of this problem?

by the way, the etcd version is:

[paas@host-205-18-1-181 tmp]$ etcd --version
etcd Version: 3.2.0+git
Git SHA: ec53528
Go Version: go1.7.4
Go OS/Arch: linux/amd64
@abel-von
Copy link
Author

abel-von commented Jul 6, 2017

after reboot the etcd on 205.18.1.181, the get pods result is now correct on both two nodes. but the etcd data size on 205.18.1.181 is still much bigger than the other two nodes.

@abel-von
Copy link
Author

abel-von commented Jul 6, 2017

Sorry that I can't give out all the logs because of security issues in my company. I checked the log and found that the log is almost consistent in the 205.18.1.181 and the other nodes.

logs on 205.18.1.181:

2017-07-06 14:33:51.275020 I | etcdserver: start to snapshot (applied: 6000060, lastsnap: 5900059)
2017-07-06 14:33:51.281880 I | etcdserver: saved snapshot at index 6000060
2017-07-06 14:33:51.282118 I | etcdserver: compacted raft log at 5995060
2017-07-06 14:34:12.836358 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000000102-000000000053ec97.snap successfully
2017-07-06 14:35:32.085964 I | wal: segmented wal file /var/paas/etcd/data/member/wal/000000000000004e-00000000005b90da.wal is created
2017-07-06 14:35:44.772899 I | pkg/fileutil: purged file /var/paas/etcd/data/member/wal/0000000000000049-000000000055a629.wal successfully
2017-07-06 14:36:22.484599 W | etcdserver: apply entries took too long [1.742258579s for 1 entries]
2017-07-06 14:36:22.484634 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:42:49.594521 W | etcdserver: apply entries took too long [1.46363076s for 1 entries]
2017-07-06 14:42:49.594573 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:50:13.741853 W | etcdserver: apply entries took too long [2.051838973s for 1 entries]
2017-07-06 14:50:13.785349 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:56:03.527963 W | etcdserver: apply entries took too long [1.837329846s for 1 entries]
2017-07-06 14:56:03.528006 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 15:03:24.532307 W | etcdserver: apply entries took too long [1.913507705s for 1 entries]
2017-07-06 15:03:24.532507 W | etcdserver: avoid queries with large range/delete range!

logs on 205.18.1.109:

2017-07-06 14:31:57.724657 W | etcdserver: apply entries took too long [105.894718ms for 1 entries]
2017-07-06 14:31:57.724722 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:32:17.002163 W | etcdserver: apply entries took too long [107.543353ms for 1 entries]
2017-07-06 14:32:17.002207 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:33:51.270602 I | etcdserver: start to snapshot (applied: 6000060, lastsnap: 5900059)
2017-07-06 14:33:51.283054 I | etcdserver: saved snapshot at index 6000060
2017-07-06 14:33:51.283288 I | etcdserver: compacted raft log at 5995060
2017-07-06 14:34:11.585157 I | mvcc: store.index: compact 1855678
2017-07-06 14:34:11.587721 I | mvcc: finished scheduled compaction at 1855678 (took 1.728646ms)
2017-07-06 14:34:19.081115 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000000102-000000000053ec97.snap successfully
2017-07-06 14:39:11.595931 I | mvcc: store.index: compact 1856170
2017-07-06 14:39:11.597772 I | mvcc: finished scheduled compaction at 1856170 (took 1.350499ms)
2017-07-06 14:44:11.604574 I | mvcc: store.index: compact 1856773
2017-07-06 14:44:11.607903 I | mvcc: finished scheduled compaction at 1856773 (took 2.700669ms)
2017-07-06 14:49:11.839561 I | mvcc: store.index: compact 1857377
2017-07-06 14:49:11.841687 I | mvcc: finished scheduled compaction at 1857377 (took 1.404751ms)
2017-07-06 14:54:11.850168 I | mvcc: store.index: compact 1857975
2017-07-06 14:54:11.852647 I | mvcc: finished scheduled compaction at 1857975 (took 1.899693ms)
2017-07-06 14:59:11.859767 I | mvcc: store.index: compact 1858573
2017-07-06 14:59:11.861873 I | mvcc: finished scheduled compaction at 1858573 (took 1.621251ms)
2017-07-06 15:04:11.879358 I | mvcc: store.index: compact 1859177
2017-07-06 15:04:11.882343 I | mvcc: finished scheduled compaction at 1859177 (took 2.462387ms)
2017-07-06 15:09:11.894468 I | mvcc: store.index: compact 1859777
2017-07-06 15:09:11.896299 I | mvcc: finished scheduled compaction at 1859777 (took 1.377486ms)

@xiang90
Copy link
Contributor

xiang90 commented Jul 6, 2017

@abel-von do you take snapshots regularly?

@abel-von
Copy link
Author

abel-von commented Jul 7, 2017

@xiang90 thanks for reply, we didn't take any snapshot. it's running normally for days, then we restart 205.18.1.181 and 205.18.1.109 , which are the two of three nodes of etcd. after restarting, the problem comes out. Will the snapshot save operation cause the db inconsistency?

@heyitsanthony
Copy link
Contributor

@abel-von can this be reproduced with 3.2.2 or 3.1.9? I can't seem to find a commit that matches ec53528...

@xiang90
Copy link
Contributor

xiang90 commented Jul 12, 2017

@abel-von kindly ping.

@abel-von
Copy link
Author

abel-von commented Jul 12, 2017

sorry for the confusion made by the git sha, we made etcd as a submodule of a outside project so the git commit is not the etcd's but the outside project's.
I think the commit id is 9b9baa968f317fc2d5870e6aa97185df126bb640 which is not a milestone version.

@heyitsanthony
Copy link
Contributor

@abel-von is there any way to easily reproduce this on 3.2.3 or 3.1.10?

@abel-von
Copy link
Author

@heyitsanthony we changed our etcd version to 3.1.9, and now the issue is reproduced more frequently.
we are using etcd as the backend of kube-apiserver. we found that kube-apiserver asks etcd to do compaction every 5 minute. But we found something wired that one of the three nodes would suddenly stopped doing the compaction, and that is the node which has a much larger data size.

we used etcd-dump-db to dump the db file, we find that the larger data retains many revisions of a key, while in the smaller data file, there is only one revision of a key in it.

in one of the smaller data size node: the compaction is continuous:

2017-07-17 01:03:05.745159 I | mvcc: finished scheduled compaction at 704629 (took 2.816189ms)
2017-07-17 01:08:05.755716 I | mvcc: finished scheduled compaction at 705869 (took 3.201933ms)
2017-07-17 01:13:05.776978 I | mvcc: finished scheduled compaction at 707112 (took 3.033131ms)
2017-07-17 01:18:05.793572 I | mvcc: finished scheduled compaction at 708341 (took 9.157498ms)
2017-07-17 01:23:05.817686 I | mvcc: finished scheduled compaction at 709577 (took 2.739097ms)
2017-07-17 01:28:05.829043 I | mvcc: finished scheduled compaction at 710809 (took 4.29966ms)
2017-07-17 01:33:05.837473 I | mvcc: finished scheduled compaction at 712051 (took 2.970656ms)
2017-07-17 01:38:05.849978 I | mvcc: finished scheduled compaction at 713284 (took 2.917362ms)
2017-07-17 01:43:05.865074 I | mvcc: finished scheduled compaction at 714519 (took 2.749105ms)
2017-07-17 01:48:05.889391 I | mvcc: finished scheduled compaction at 715769 (took 5.606061ms)
2017-07-17 01:53:05.893834 I | mvcc: finished scheduled compaction at 716992 (took 2.747918ms)
2017-07-17 01:58:05.906794 I | mvcc: finished scheduled compaction at 718233 (took 2.902599ms)
2017-07-17 02:03:05.918917 I | mvcc: finished scheduled compaction at 719494 (took 2.738225ms)
2017-07-17 02:08:05.987081 I | mvcc: finished scheduled compaction at 720729 (took 2.924437ms)
2017-07-17 02:13:05.960468 I | mvcc: finished scheduled compaction at 721969 (took 3.349982ms)
2017-07-17 02:18:05.968872 I | mvcc: finished scheduled compaction at 723198 (took 2.791497ms)
2017-07-17 02:23:05.978155 I | mvcc: finished scheduled compaction at 724446 (took 2.861506ms)

in the larger data size node, the compaction suddenly stoped at the time of 2017-07-17 01:38:05:

2017-07-17 01:08:05.759264 I | mvcc: finished scheduled compaction at 705869 (took 3.938301ms)
2017-07-17 01:13:05.786355 I | mvcc: finished scheduled compaction at 707112 (took 4.409887ms)
2017-07-17 01:18:05.807977 I | mvcc: finished scheduled compaction at 708341 (took 3.087963ms)
2017-07-17 01:23:05.832083 I | mvcc: finished scheduled compaction at 709577 (took 2.88093ms)
2017-07-17 01:28:05.844962 I | mvcc: finished scheduled compaction at 710809 (took 3.010959ms)
2017-07-17 01:33:05.838528 I | mvcc: finished scheduled compaction at 712051 (took 3.402203ms)
2017-07-17 01:38:05.851801 I | mvcc: finished scheduled compaction at 713284 (took 3.151377ms)

and we didn't find any error log in this node during the time the node stop doing compaction, there is only some warnings showing that the io delay is high.

2017-07-17 01:26:46.403773 W | etcdserver: apply entries took too long [120.056221ms for 1 entries]
2017-07-17 01:26:46.403814 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:28:05.841085 I | mvcc: store.index: compact 710809
2017-07-17 01:28:05.844962 I | mvcc: finished scheduled compaction at 710809 (took 3.010959ms)
2017-07-17 01:28:10.990740 W | etcdserver: apply entries took too long [184.784236ms for 1 entries]
2017-07-17 01:28:10.990781 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:31:32.620661 W | etcdserver: apply entries took too long [148.679896ms for 1 entries]
2017-07-17 01:31:32.620687 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:33:05.834143 I | mvcc: store.index: compact 712051
2017-07-17 01:33:05.838528 I | mvcc: finished scheduled compaction at 712051 (took 3.402203ms)
2017-07-17 01:36:57.596061 W | etcdserver: apply entries took too long [188.445365ms for 1 entries]
2017-07-17 01:36:57.596106 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:38:05.847637 I | mvcc: store.index: compact 713284
2017-07-17 01:38:05.851801 I | mvcc: finished scheduled compaction at 713284 (took 3.151377ms)
2017-07-17 01:38:50.073201 I | etcdserver: start to snapshot (applied: 1800193, lastsnap: 1790192)
2017-07-17 01:38:50.123828 I | etcdserver: saved snapshot at index 1800193
2017-07-17 01:38:50.124042 I | etcdserver: compacted raft log at 1795193
2017-07-17 01:39:02.621138 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001ab4ac.snap successfully
2017-07-17 01:49:47.024269 W | etcdserver: apply entries took too long [133.807871ms for 1 entries]
2017-07-17 01:49:47.024298 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:54:32.086371 I | etcdserver: start to snapshot (applied: 1810194, lastsnap: 1800193)
2017-07-17 01:54:32.145961 I | etcdserver: saved snapshot at index 1810194
2017-07-17 01:54:32.146224 I | etcdserver: compacted raft log at 1805194
2017-07-17 01:54:32.631207 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001adbbd.snap successfully
2017-07-17 02:04:24.214397 W | etcdserver: apply entries took too long [163.956377ms for 1 entries]
2017-07-17 02:04:24.214438 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 02:06:57.986956 W | wal: sync duration of 1.083674392s, expected less than 1s
2017-07-17 02:10:16.006372 I | etcdserver: start to snapshot (applied: 1820195, lastsnap: 1810194)
2017-07-17 02:10:16.054768 I | etcdserver: saved snapshot at index 1820195
2017-07-17 02:10:16.055072 I | etcdserver: compacted raft log at 1815195
2017-07-17 02:10:32.642091 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001b02ce.snap successfully
2017-07-17 02:10:54.531128 W | etcdserver: apply entries took too long [204.17913ms for 1 entries]

@abel-von
Copy link
Author

we checked the snapshot log of the nodes and find that there is a little difference between the larger data node and the smaller data nodes.

in smaller data node:

2017-07-19 06:23:42.791337 I | etcdserver: start to snapshot (applied: 4126960, lastsnap: 4116959)
2017-07-19 06:31:18.224505 I | etcdserver: start to snapshot (applied: 4136961, lastsnap: 4126960)
2017-07-19 06:38:52.406597 I | etcdserver: start to snapshot (applied: 4146962, lastsnap: 4136961)
2017-07-19 06:46:27.908548 I | etcdserver: start to snapshot (applied: 4156963, lastsnap: 4146962)
2017-07-19 06:54:04.166864 I | etcdserver: start to snapshot (applied: 4166964, lastsnap: 4156963)
2017-07-19 07:01:39.354258 I | etcdserver: start to snapshot (applied: 4176965, lastsnap: 4166964)
2017-07-19 07:09:13.906549 I | etcdserver: start to snapshot (applied: 4186966, lastsnap: 4176965)
2017-07-19 07:16:48.424667 I | etcdserver: start to snapshot (applied: 4196967, lastsnap: 4186966)
2017-07-19 07:24:24.091817 I | etcdserver: start to snapshot (applied: 4206968, lastsnap: 4196967)
2017-07-19 07:31:58.908751 I | etcdserver: start to snapshot (applied: 4216969, lastsnap: 4206968)
2017-07-19 07:39:32.684813 I | etcdserver: start to snapshot (applied: 4226970, lastsnap: 4216969)
2017-07-19 07:47:08.233998 I | etcdserver: start to snapshot (applied: 4236971, lastsnap: 4226970)
2017-07-19 07:54:42.451334 I | etcdserver: start to snapshot (applied: 4246972, lastsnap: 4236971)
2017-07-19 08:02:16.546734 I | etcdserver: start to snapshot (applied: 4256973, lastsnap: 4246972)
2017-07-19 08:09:49.906316 I | etcdserver: start to snapshot (applied: 4266974, lastsnap: 4256973)
2017-07-19 08:17:25.005495 I | etcdserver: start to snapshot (applied: 4276975, lastsnap: 4266974)
2017-07-19 08:24:58.112552 I | etcdserver: start to snapshot (applied: 4286976, lastsnap: 4276975)
2017-07-19 08:32:31.332365 I | etcdserver: start to snapshot (applied: 4296977, lastsnap: 4286976)
2017-07-19 08:40:13.543569 I | etcdserver: start to snapshot (applied: 4306978, lastsnap: 4296977)
2017-07-19 08:47:46.753345 I | etcdserver: start to snapshot (applied: 4316979, lastsnap: 4306978)
2017-07-19 08:55:18.661920 I | etcdserver: start to snapshot (applied: 4326980, lastsnap: 4316979)
2017-07-19 09:02:49.116010 I | etcdserver: start to snapshot (applied: 4336981, lastsnap: 4326980)
2017-07-19 09:10:21.354174 I | etcdserver: start to snapshot (applied: 4346982, lastsnap: 4336981)
2017-07-19 09:17:52.579132 I | etcdserver: start to snapshot (applied: 4356983, lastsnap: 4346982)
2017-07-19 09:25:24.363432 I | etcdserver: start to snapshot (applied: 4366984, lastsnap: 4356983)
2017-07-19 09:32:54.985520 I | etcdserver: start to snapshot (applied: 4376985, lastsnap: 4366984)
2017-07-19 09:40:25.765002 I | etcdserver: start to snapshot (applied: 4386986, lastsnap: 4376985)
2017-07-19 09:47:58.089840 I | etcdserver: start to snapshot (applied: 4396987, lastsnap: 4386986)

the larger data node:

2017-07-19 07:04:15.916262 I | etcdserver: start to snapshot (applied: 4180436, lastsnap: 4170435)
2017-07-19 07:11:50.920271 I | etcdserver: start to snapshot (applied: 4190437, lastsnap: 4180436)
2017-07-19 07:19:25.395998 I | etcdserver: start to snapshot (applied: 4200438, lastsnap: 4190437)
2017-07-19 07:27:01.734404 I | etcdserver: start to snapshot (applied: 4210439, lastsnap: 4200438)
2017-07-19 07:34:36.033081 I | etcdserver: start to snapshot (applied: 4220440, lastsnap: 4210439)
2017-07-19 07:42:10.966994 I | etcdserver: start to snapshot (applied: 4230441, lastsnap: 4220440)
2017-07-19 07:49:45.509161 I | etcdserver: start to snapshot (applied: 4240442, lastsnap: 4230441)
2017-07-19 07:57:21.390336 I | etcdserver: start to snapshot (applied: 4250443, lastsnap: 4240442)
2017-07-19 08:04:54.183961 I | etcdserver: start to snapshot (applied: 4260444, lastsnap: 4250443)
2017-07-19 08:12:27.733936 I | etcdserver: start to snapshot (applied: 4270445, lastsnap: 4260444)
2017-07-19 08:20:02.153146 I | etcdserver: start to snapshot (applied: 4280446, lastsnap: 4270445)
2017-07-19 08:27:36.695936 I | etcdserver: start to snapshot (applied: 4290447, lastsnap: 4280446)
2017-07-19 08:35:07.531769 I | etcdserver: start to snapshot (applied: 4300448, lastsnap: 4290447)
2017-07-19 08:42:51.796759 I | etcdserver: start to snapshot (applied: 4310449, lastsnap: 4300448)
2017-07-19 08:50:22.368311 I | etcdserver: start to snapshot (applied: 4320450, lastsnap: 4310449)
2017-07-19 08:57:54.783186 I | etcdserver: start to snapshot (applied: 4330451, lastsnap: 4320450)
2017-07-19 09:05:26.943601 I | etcdserver: start to snapshot (applied: 4340452, lastsnap: 4330451)
2017-07-19 09:12:57.727162 I | etcdserver: start to snapshot (applied: 4350453, lastsnap: 4340452)
2017-07-19 09:20:29.896829 I | etcdserver: start to snapshot (applied: 4360454, lastsnap: 4350453)
2017-07-19 09:28:00.316796 I | etcdserver: start to snapshot (applied: 4370455, lastsnap: 4360454)
2017-07-19 09:35:31.473455 I | etcdserver: start to snapshot (applied: 4380456, lastsnap: 4370455)
2017-07-19 09:43:03.866372 I | etcdserver: start to snapshot (applied: 4390457, lastsnap: 4380456)
2017-07-19 09:50:35.546458 I | etcdserver: start to snapshot (applied: 4400458, lastsnap: 4390457)

I think the applied log index can still be considered as in the same pace, as I know the compaction is also a command in a log entry, the same as put of deleteRange, if the log entries applied in a same pace, then, the compaction should also be applied to kvstore in a same pace.

@abel-von
Copy link
Author

abel-von commented Jul 19, 2017

we used the command kubectl endpoint status --write-out=json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' to check the revision of each node, and the output is below:

1469425
1289989
1469425

the second node is the one which has a larger db file, the revision is much smaller than the other two, does this mean that a lot of log entries is not applied into the kvstore?

the simple output of endpoint status is:

https://xxxx:4001, 5515cd325aac0173, 3.1.9, 182 MB, false, 5172, 4434252
https://xxxx:4001, b143a63e2d946966, 3.1.9, 2.1 GB, true, 5172, 4434254
https://xxxx:4001, e6bcda184371c529, 3.1.9, 182 MB, false, 5172, 4434255

as we can see that the data of the second has data of 2.1GB, which is larger than the default max db size. and now the cluster can not serve any updating request.

@abel-von
Copy link
Author

abel-von commented Jul 19, 2017

This seems to be an issue about the processing of applying an entry log to kvstore.
So I checked the source code of applying, I have a question about ( https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1273 ) , it seems that the etcdserver will always increment the appliedi no matter if the applying is succeed of failed. Will the data be inconsistent if the log entry is not succeed applied to kv? shouldn't it retry until the applying return succeed before it apply the next log entry?

@heyitsanthony
Copy link
Contributor

we have checked one key on both the two nodes by get /xxxx --consistency="s", and we found that the data in 205.18.1.181 is wrong and outdated.

That's a serialized read, it's not expected to be up to date since it doesn't go through quorum but it should match the historical data. If --consistency=l is giving inconsistent data, that's a problem.

the second node is the one which has a larger db file, the revision is much smaller than the other two, does this mean that a lot of log entries is not applied into the kvstore?

The node is behind, but it shouldn't be skipping any entries.

as we can see that the data of the second has data of 2.1GB, which is larger than the default max db size. and now the cluster can not serve any updating request.

The larger node needs to be defragmented; it's raising an alarm on the cluster because it's exceeding is space quota. After defragmenting the member the alarm can be disabled with etcdctl alarm disarm.

it seems that the etcdserver will always increment the appliedi no matter if the applying is succeed of failed

If the apply fails then etcd should panic since the entry has been committed to raft; there's no turning back if there's an error in the apply path.

@xiang90
Copy link
Contributor

xiang90 commented Jul 19, 2017

The node is behind, but it shouldn't be skipping any entries.

We are not sure if it falls behind actually. k8s uses txn with conditions a lot. If there the state is indeed inconsistent on that node, revision might also not increase due to txn failures.

@abel-von
Copy link
Author

@heyitsanthony, @xiang90 , thanks for reply, we also checked the --consistency=l and found that one node of the cluster return data inconsistent with the other two. we can defragment the db, but the problem is, the node has a very large db file and retains many versions of a key in it. we have to do a compaction manually before defragment.

and k8s is automatically requesting to compact db, it's just one node of etcd didn't do it.

I have add a log in the function applyEntryNormal that if the applying failed, an error will be logged.
( https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1354 )

        if ar == nil {
                return
        }
        if ar.err != nil {
                plog.Errorf("Error applying entry %v, error: %v", raftReq, ar.err)
        }
        if ar.err != ErrNoSpace || len(s.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 {
                s.w.Trigger(id, ar)
                return
        }

and we got the log like below:

2017-07-19 18:50:09.220063 E | etcdserver: Error applying entry {ID:12175021278856673156  0 <nil> <nil> <nil> <nil> <nil> revision:184894  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 18:55:09.243377 E | etcdserver: Error applying entry {ID:12175021278856675351  0 <nil> <nil> <nil> <nil> <nil> revision:185381  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:00:09.248930 E | etcdserver: Error applying entry {ID:12175021278856677507  0 <nil> <nil> <nil> <nil> <nil> revision:185860  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:05:09.255870 E | etcdserver: Error applying entry {ID:12175021278856679813  0 <nil> <nil> <nil> <nil> <nil> revision:186339  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:10:09.263268 E | etcdserver: Error applying entry {ID:12175021278856682125  0 <nil> <nil> <nil> <nil> <nil> revision:186819  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:15:09.271818 E | etcdserver: Error applying entry {ID:12175021278856684440  0 <nil> <nil> <nil> <nil> <nil> revision:187299  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:20:09.382805 E | etcdserver: Error applying entry {ID:12175021278856686711  0 <nil> <nil> <nil> <nil> <nil> revision:187778  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:25:09.339451 E | etcdserver: Error applying entry {ID:12175021278856688889  0 <nil> <nil> <nil> <nil> <nil> revision:188257  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision

as we can see, the compaction failed because the revision in the request of kube-apiserver is larger than the current revision in this node. that's why the compaction is failed.

I think we still have to check why the revision in one node is much smaller than the others. as in my comment before, we can see that the applied index in three nodes is almost advancing in the same pace. so I don't think the node with larger file falls behind.

I will add more logs to see why the revision in a node is far smaller than the other.

@heyitsanthony
Copy link
Contributor

@abel-von OK, it's definitely inconsistent. Without the wal + db I'm not sure if there's much we can do to find the root cause. Removing the faulty member via etcdctl member remove and adding a new one should clear the inconsistency.

Was the member only rebooted with 3.1.9 or was it a freshly initialized member? There was a period during the 3.2 development cycle where there could be inconsistencies but it was caught before 3.2.0. Also, there's no commit matching 9b9baa968f317fc2d5870e6aa97185df126bb640.

@abel-von
Copy link
Author

@heyitsanthony , yes we are now using 3.1.9. the 9b9baa968f317fc2d5870e6aa97185df126bb640 is in the first cluster where we got this problem, but after that, our version has changed to 3.1.9.

could you please send me the commit ids to solve the inconsistency problem ?

@abel-von
Copy link
Author

abel-von commented Jul 20, 2017

@xiang90 I have checked the code of kube-apiserver, and as you said, all the updating in k8s is a CAS operation, with etcdv3's Txn. it compare the revision of a key and update the key if the compare returns true.
(https://github.com/kubernetes/kubernetes/blob/v1.5.0/pkg/storage/etcd3/store.go#L270)

I am wondering that if one txn failed on a node, than all txns after that will be failed on that node because the revision is not consistent, and as the txn failed more, the revision will be more inconsistent. that makes the system very unstable, isn't it?

will a node fall behind a little fails a txn? if it fails once, then fails forever?

@abel-von
Copy link
Author

@heyitsanthony , thanks, but i don't want just repair the failed cluster. actually, this problem comes out in a lot of testing clusters now. I have to find the root cause of this and fix the problem.

@heyitsanthony
Copy link
Contributor

@abel-von the member inconsistency can be reproduced with a fresh cluster running 3.1.9?

@abel-von
Copy link
Author

yes, it can be reproduced, I think if one of the node handle log entries in slightly different pace with the leader, and if you continue updating some particular keys with CAS to compare the ModRevision with a short period, like 10 seconds in k8s to update its' node status. then the issue can be reproduced.

@gyuho
Copy link
Contributor

gyuho commented Jul 20, 2017

@abel-von 9b9baa968f317fc2d5870e6aa97185df126bb640 does not exist. Do you run official etcd release?

@abel-von
Copy link
Author

abel-von commented Jul 21, 2017

@gyuho please just forget the commit id 9b9baa968f317fc2d5870e6aa97185df126bb640, it's the first time we get this issue. After that we have changed the etcd version to 3.1.9. which is the official release. And the problem is still there with version of 3.1.9

@abel-von
Copy link
Author

abel-von commented Jul 24, 2017

@gyuho @heyitsanthony @xiang90 I'm confused by this Txn function in apply.go. It first do the comparison and then begin a transaction to update the backend data. I thought the compare and swap should be atomic and so they should be processed in ONE transaction. but now the comparison is out of the transaction, will this there be a race condition problem here?

func (a *applierV3backend) Txn(rt *pb.TxnRequest) (*pb.TxnResponse, error) {
	ok := true
	for _, c := range rt.Compare {
		if _, ok = a.applyCompare(c); !ok {
			break
		}
	}

	var reqs []*pb.RequestOp
	if ok {
		reqs = rt.Success
	} else {
		reqs = rt.Failure
	}

	if err := a.checkRequestLeases(reqs); err != nil {
		return nil, err
	}
	if err := a.checkRequestRange(reqs); err != nil {
		return nil, err
	}

	// When executing the operations of txn, we need to hold the txn lock.
	// So the reader will not see any intermediate results.
	txnID := a.s.KV().TxnBegin()

	resps := make([]*pb.ResponseOp, len(reqs))
	for i := range reqs {
		resps[i] = a.applyUnion(txnID, reqs[i])
	}

	err := a.s.KV().TxnEnd(txnID)
	if err != nil {
		panic(fmt.Sprint("unexpected error when closing txn", txnID))
	}

	txnResp := &pb.TxnResponse{}
	txnResp.Header = &pb.ResponseHeader{}
	txnResp.Header.Revision = a.s.KV().Rev()
	txnResp.Responses = resps
	txnResp.Succeeded = ok
	return txnResp, nil
}

@heyitsanthony
Copy link
Contributor

@abel-von that's the bug fixed by d173b09; I could reproduce it using ad22aaa cherry-picked to the 3.1 branch. I'm not sure how it would cause a backend inconsistency (in this case there's a race causing inconsistency for reads) since all writes are serialized in the raft path. Possibly a race with compaction processing.

@abel-von
Copy link
Author

abel-von commented Jul 26, 2017

@xiang90 @heyitsanthony @gyuho We have found the root cause of this issue. The etcd cluster is serving not only the kube-apiserver, but also other components, that other components are using the v2 API.

The v2 and v3 data in etcd are separated from each other. there is a modifiedIndex in v2 and ModRevision in v3. and in our cluster the v2 grows faster than v3 because other components update their data more frequently than kube-apiserver.
In our start script of etcd, we did a stupid data migration from v2 to v3, and this migration only happens on one of the three nodes. during migration, the modifiedIndex in v2 became the ModRevision in v3, and as the modifiedIndex in v2 is larger than ModRevision in v3, the modifiedIndex would cover the ModRevision in v3, and this only happened on node of the three. that makes the inconsistency.

As this migration is done obscurely, and after the first migration, the kube-apiserver update data more frequently than others, that makes this issue hard to reproduce in the same cluster.

I am wondering if there is any mechanism to prevent this stupid operation from being successfully done?

@heyitsanthony
Copy link
Contributor

@abel-von what's the process for the migration? the member is taken offline, a migration script runs, the server is brought back online, and it rejoins the cluster?

@abel-von
Copy link
Author

@heyitsanthony , Yes, as the backend data mismatch is not detected by the raft, it can rejoin to the cluster successfully.

@heyitsanthony
Copy link
Contributor

@abel-von this sort of thing will be caught by #7125

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@heyitsanthony

Still, it is hard for us, etcd maintainers, to tell how corruption happened. Probably it worth the effort to think about how to detect user faults in a better way.

@heyitsanthony
Copy link
Contributor

@xiang90 sure, but detecting corruption within a minute or so of bringing the member online already goes along way for debugging this sort of misconfiguration. The next step would be checking the hash before joining as a raft peer, which would have the advantage of avoiding booting into an inconsistent state and serving garbage.

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@heyitsanthony

Can we create a new issue to capture the detecting data consistency thing? The title of this issue is terrifying. Cannot sleep for a while due to this "bug" :P.

@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

we're seeing a similar issue occur at some point after an etcd v2->v3 migration that does not correct itself after a restart of the etcd members. one member continues to answer incorrect data for a particular query, despite all members claiming to be at the same raft index.

cc @smarterclayton

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@liggitt That is a different issue. This one is a user error. I really need to close it since this is confusing.

@xiang90 xiang90 closed this as completed Jul 26, 2017
@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

I see, will add details to #8305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants