-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720
Comments
@shaharmor, Thank you for reporting the issue. I am not sure if crashing the data node was the root cause. Are you using index template in |
There is no I initially created the first
|
Thank you for your feedback @shaharmor. Rollover consists of two separate steps: (1) Create a new index and wait for that index to be ready; (2) Updates the alias to point to the new index. If the index template's aliases contain the rollover alias, the rollover alias will point to two indices between step 1 and step 2. Unfortunately, this is your case; you should remove the |
@dnhatn So how come it worked until today? And how come it keeps working as we speak? |
Also, correct me if I'm wrong, but your guide: https://www.elastic.co/blog/managing-time-based-indices-efficiently specifically say to add the alias to the index template |
Unfortunately we only see the issue recently. We should update the blog. |
I also think that this change (#28110) should be marked as a breaking change for ES 6.2 |
@shaharmor, Sorry I missed your comment.
This works because the window between step 1 and step 2 is very small and it's no problem if there is no write operation in that window. |
So basically we were lucky all this time 😁 |
Elasticsearch version (
bin/elasticsearch --version
):Version: 6.0.0, Build: 8f0685b/2017-11-10T18:41:22.859Z, JVM: 1.8.0_151
Plugins installed: [repository-azure]
JVM version (
java -version
):java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux xxx-001 4.11.0-1014-azure #14-Ubuntu SMP Tue Oct 17 12:10:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have a cron job that runs the
rollover
command on an alias namedcustomers
every 1 hour, at 1 minute past the start of the hour.The job has been running for the past few months flawlessly.
We recently (1 hour ago) had a critical issue that caused Elasticsearch to become un-writeable.
The cluster architecture is as follow:
3 master nodes
3 hot data nodes
3 cold data nodes
Alias points to an index on the hot data nodes only.
What happened is that 3 seconds before the cronjob started, one of the cold data nodes suddenly rebooted (Not related to Elasticsearch).
Relevant log:
3 seconds later the cronjob started and tried to rollover the index, as seen in these logs:
30 seconds later, with no other logs in the middle, the rollover times out because of the crashed server:
Following that are some errors about timeouts and the removal of the crashed server from the pool:
What happened following that whole thing is that Elasticsearch created the new alias (As it should have, because we called the rollover command), but it did not delete the old alias, thus reaching a point that is has two aliases, both named
customers
pointing to two different indices, as can be seen here:The
customers-raw-003281
is the old index that was supposed to be rolled-over tocustomers-raw-003282
by the cronjob. Apparently it managed to roll it over but not delete the old alias.This caused the entire cluster to become un-writeable to the
customers
alias because Elasticsearch didn't know where to write the data to (I guess).We had to manually delete the old alias reference so that the cluster will be writeable again.
There were no logs what so ever about the problem, and we had to manually figure it out by ourselves.
Elasticsearch appeared to be operating (health status green), nothing in the logs.
Logstash only logged a message that it has an issue that it can retry and kept retrying for ever, again, with no specific log about the error.
The text was updated successfully, but these errors were encountered: