-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex API : improve robustness in case of error #22471
Comments
Hi @cnico Sorry to hear about your troubles with reindexing.
This is problematic because reindexing might target billions of documents, all of which might have errors. We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.
This can be done today by running the reindex job with |
Hi @clintongormley, For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state. Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
I hope you will reconsider your point of view in order to improve robustness. |
I agree with @cnico. It would be great to have a param like the conflicts which allows me to ignore errors. My usecase is that I wan't to reindex one bucket but the api fails every time because of the error |
I agree with @cnico as well here -- reindex is really cool, but a huge pain to use if just a single error happens. When reindexing billions of documents, even a single error causes you to start all over again (assuming the error is transient). It's a huge pain, and requires us to use external ETL to re-index. |
Hi @clintongormley, I also agree with @cnico, The reindex api is really useful and works great until their is one error, we need to index billions of documents and it fails every time because several of bad documents. |
What if all of your documents have errors? Where would we log billions of errors? The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think? |
This was in the first version of reindex that I had actually. We decided the complexity wasn't worth it at the time I believe. But we can still do it if we want. If we did this, what http code would we return if we have errors but not enough to abort. Traditionally we've returned 200 in those cases. There is a 207 Multi-Status HTTP response but I think it is pretty mixed up with webdav so maybe it is trouble. Not sure! |
What do you think about a force option as a first step to just ignore all errors and do what is possible? |
A force option would require logging the errors instead of returning them. We could count them but that is it. I don't particularly like that idea. |
I agree with you that this isn't a good option. I don't really know how the sdks interact with the server but for the rest api would be maybe the http chunked transfer an idea but you have to keep the connection open until the reindex is finished. Then you don't need to store the logs on the server an can transfer them directly to the client |
I think 200 is OK. We do what the user asks, ie ignore errors, and so complete successfully |
Elasticsearch is kind of build around request/response sadly and it'd be a huge change to make chunked transfer style things work. Relative to counting errors, that is a moonshot. |
A workaround for the problem could be using |
+1 for the fault tolerant reindexing
|
Guys I need this, +1 for @ZombieSmurf... it's such an easy solution Temporary fix:
|
How to retrieve the the results of a Reindexing when the initial API call timed out as it tried to wait for completion? As long as the Reindexing was running I could see the status at: |
A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while. |
@bleskes - Any progress on this? We have version 6.2.3. When using Curator for a Reindex, Curator seems to ignore documents which don't meet the new Mapping file. For example if a date is malformed, it'll just drop the document from the Reindex, no error message. |
@rahst12 I'm afraid my previous statement still holds:
|
@bleskes any updates on this? Is there a possibility from any experienced ES developer to mentor this that implementation? |
@thePanz sorry for the late response, I was out and catching up. We're always ready to guide external contributions. I have to warn you though that this will not be a simple one. |
@bleskes - Is there any update on this ? Or maybe is there any way we can see logs while reindexing is in progress and stops at any error (it will be at the least useful to identify the document which caused error in reindexing) |
@PraneetKhandelwal the cause of errors should be returned in the failure field of the response/reindexing result - see here. |
Pinging @elastic/es-distributed |
@bleskes can the reindex API put the failed docs into some DLQ index with a failure reason where dynamic mapping is disabled for the DLQ index and continue indexing the rest of the docs? |
@Alsheh that is one of the ideas that we have discussed too. We are not actively working on this, but open to external contributions on this. |
Any update on this please? The Reindex API is great but it's severely limited in value by this stop on failure behaviour. At least for my use case. |
Pinging this - still an issue in 2022. Seems like dirty bash script will have to do. |
So today I wanted to update some mappings on log data that I have accumulated (small index, couple million documents). Even for an index of this small size, reindexing to update mappings is severely painful.
Generally speaking, reindexing is often a manual job - it's something that has many failure pathways, and at least in my case, it came from a manual update to the mappings. For this reason, there should at least be some opt-in control over the bail behaviour, such that I can run with immediate bail if I want to check that I have configured the index settings correctly, but disable or loosen bailing and drop failing documents after I'm sure that my mappings/configuration is what I want. Index reconfiguration can include designing for data to be dropped if it doesn't fit the new configuration. |
Still a major pain point in 2023. |
Agreed. We have to setup a separate pipeline in Logstash to ingest specific indexes and have them go through the pipeline again just so the failures get pushed to the DLQ instead of aborting the whole reindex. |
Hello,
I did a migration from elasticsearch 2.3.4 to 5.1.1 by following the migration guide.
The migration went perfectly well and I updated my mapping in order to use the new keyword type instead of the string not analyzed old one.
So I wanted to reindex all my index and I encountered the 2 following problems :
So my reindex task suddenly stops, leaving lots of my documents not being reindexed because of the error on one document.
The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.
Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.
In my case, i was not able to correct the data of my old indexes so I decided not to reindex them finally...
Regards,
The text was updated successfully, but these errors were encountered: