-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit is out of range #8935
Comments
Is the old etcd stopped before the new one is started? What is the exact sequence of events happening? Also can you provide the full log instead of just the 3 lines before panicing? |
Yes - it is stopped. The sequence is:
Yeah sorry.
|
Thanks. I am not sure if this is the same issue you linked as I have not
looked into the logs. We will give this a look soon.
…On Wed, Nov 29, 2017 at 7:19 AM Wojciech Tyczynski ***@***.***> wrote:
Yes - it is stopped. The sequence is:
- all 3 instances of etcd is running
- one of them is being stopped (be deleting the VM in which it is
running)
- the VM is being recreated (the PersistentDisk where etcd data lives
is attached)
- the etcd is started
Also can you provide the full log instead of just the 3 lines before
panicing?
Yeah sorry.
- Logs from few minutes from the initial VM (before it was deleted):
https://gist.github.com/wojtek-t/22665ffaee649a9c263fea090766a1a7
- Logs from recreated VM:
https://gist.github.com/wojtek-t/f9b9f27fcfaf7207464d0c6b81f61f5e
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8935 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AERby_epjttx_OncgaEdZxFjqfyvK7gGks5s7XYFgaJpZM4Qu1ol>
.
|
How large is each request you sent to etcd? Are you able to share the broken WAL file? Are you able to reproduce this problem every time you did the sequences of operations you listed? |
No - it's not reliably reproducible. We've seen this twice in the last few days in continuously running suite.
These are coming from our internal k8s tests. But the cluster is pretty small (3 nodes IIRC), so there isn't anything that is very big. |
OK. Also can you try on latest 3.2 to see if you can reproduce it? Thanks. |
@wojtek-t I'll start an investigation of this shortly. |
And I will try running with 3.1 and potentially also 3.2 to see if any of those is fixing those. |
FYI - we have changed our tests to run etcd in 3.1.11 version. So far it looks promising, but we need a bit more time to validate it. |
We probably should move your test to upstream. It will make sure that etcd wont introduce regression that breaks it. It is fine for the test case to be specific to GCP environment. All our e2e descriptive tests are running against GCP: http://dash.etcd.io/dashboard/db/functional-tests?orgId=1. @SaranBalaji90 from AWS will setup the test on AWS environment. Azure might want to do the same thing. |
I will be surprised that etcd 3.1.11 fixes the root cause. I feel the root cause is somewhere in raft pkg, and it is a pretty subtle one. We probably should still spend sometime to understand why it happens under 3.0.10, and if 3.1.x fixes it which commit exactly fixed it. |
Definitely - Joe already said that he will be looking into that. |
Yes. As glad as I am to see the problem go away. It makes me uncomfortable
not knowing why. We’ll get to the bottom of it.
…On Mon, Dec 11, 2017 at 11:50 AM Wojciech Tyczynski < ***@***.***> wrote:
Definitely - Joe already said that he will be looking into that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8935 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAf9Ru-vB6Qu67H0sSyUCIgs30UirK1jks5s_Yd6gaJpZM4Qu1ol>
.
|
I encountered a similar case, but it is due to no space left on device, and the etcd holds v2 and v3 data etcd version
produceIt produced by following steps ... I run a three node (HA) etcd cluster (such as node-1/node-2/node-3), Unfortunately, after a long time, node-1 was out of disk, in that time the remain two etcd members still running. Then I clear some useless data to free disk space, but the free space still can not match the need of etcd as it will receive a snapshot from the leader member (as its data behind far more then the cluster). Then things got worse. The following logs shows that when i clear some data the action taken by node-1 etcd, it receive snapshot from leader and restore node-1 etcd logs
however, I found this segment shows write wal and snapshot is a non-atomic operation
And at this point, I dump the wal of etcd node-1 etcd wal dump logs
it can be seen commitIndex has been set to 116658132, but as we know snapshot was saved error in node-1 etcd logs, then node-1 etcd will encounter a panic when it restarted again
can this be a cause to this problem or am i misunderstanding something about my problem ? |
@xiang90 - Joe wanted to look into that. I can only say that 3.1.11 release is fixing this issue. |
I feel 3.1 probably includes something mitigates the issue, not fixing it. I am concerned that the root cause still exists, and eventually you will hit this bug again after a longer run. |
Hello guys! Anyone knows which commit solves or imtigates the issue in 3.1.11 ? Any suggestions? |
Fix is in bbolt |
So who knows how to fix it in 3.0.17 if happened this problem? Any workrounds? I tried remove member and add member again, it does not work and shows the same error. |
Sorry, for unknown reason, failed to remove etcd data directory in the crashed node. So it is still crashed. |
We've observed a problem with clustered etcd, that after recreating one of the instances, it is entering crashloop and panicing with:
What exactly is happening is that we are recreating a VM where the etcd is running (without calling remove/add member) preserving all the addresses, etcd data etc. So after recreation, it the etcd is being started as the old member (similarly as it was restarted, though with a big bigger downtime).
We are running 3.0.17 version.
I found this #5664, which seems to be a similar (same?) problem. Has this been fixed? If so, in what release?
@xiang90 @gyuho @heyitsanthony @jpbetz
The text was updated successfully, but these errors were encountered: