Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex API : improve robustness in case of error #22471

Open
cnico opened this issue Jan 6, 2017 · 31 comments
Open

Reindex API : improve robustness in case of error #22471

cnico opened this issue Jan 6, 2017 · 31 comments
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement stalled Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@cnico
Copy link

cnico commented Jan 6, 2017

Hello,

I did a migration from elasticsearch 2.3.4 to 5.1.1 by following the migration guide.
The migration went perfectly well and I updated my mapping in order to use the new keyword type instead of the string not analyzed old one.

So I wanted to reindex all my index and I encountered the 2 following problems :

  • some of my documents had id of length longer than 512 characters and elasticsearch 5.1.1 does not accept it while a reindexation.
  • some of my documents had fields whose name is an empty string : elasticsearch 5.1.1 refuses to reindex such field.

So my reindex task suddenly stops, leaving lots of my documents not being reindexed because of the error on one document.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

In my case, i was not able to correct the data of my old indexes so I decided not to reindex them finally...

Regards,

@clintongormley
Copy link
Contributor

Hi @cnico

Sorry to hear about your troubles with reindexing.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

This is problematic because reindexing might target billions of documents, all of which might have errors. We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

This can be done today by running the reindex job with ?wait_for_completion=false. You get back a task ID which can be passed to the task API to get the job status. The final status is stored in the .tasks index and will remain there until you delete it.

@cnico
Copy link
Author

cnico commented Jan 12, 2017

Hi @clintongormley,

For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state.

Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
The strategies could be :

  • simple stop as actually and tell which document caused the failure.
  • ignore every errors without care
  • ignore errors and trace the document ids that are the cause of the failure, for example in a dedicated index
  • computes the error rate (per minutes, per index, per server, per shard to be determined), and if higher than a given rate, stops.

I hope you will reconsider your point of view in order to improve robustness.

@mathewmeconry
Copy link

Hi @clintongormley

I agree with @cnico. It would be great to have a param like the conflicts which allows me to ignore errors. My usecase is that I wan't to reindex one bucket but the api fails every time because of the error "Can't get text on a START_OBJECT at 1:251"

@Cidan
Copy link

Cidan commented May 1, 2017

I agree with @cnico as well here -- reindex is really cool, but a huge pain to use if just a single error happens. When reindexing billions of documents, even a single error causes you to start all over again (assuming the error is transient). It's a huge pain, and requires us to use external ETL to re-index.

@shimonste
Copy link

Hi @clintongormley,

I also agree with @cnico, The reindex api is really useful and works great until their is one error, we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.
I also hope that you reconsider your point of view

@clintongormley
Copy link
Contributor

we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.

What if all of your documents have errors? Where would we log billions of errors?

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

@nik9000
Copy link
Member

nik9000 commented May 19, 2017

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

This was in the first version of reindex that I had actually. We decided the complexity wasn't worth it at the time I believe. But we can still do it if we want.

If we did this, what http code would we return if we have errors but not enough to abort. Traditionally we've returned 200 in those cases. There is a 207 Multi-Status HTTP response but I think it is pretty mixed up with webdav so maybe it is trouble. Not sure!

@mathewmeconry
Copy link

What do you think about a force option as a first step to just ignore all errors and do what is possible?

@nik9000
Copy link
Member

nik9000 commented May 19, 2017

A force option would require logging the errors instead of returning them. We could count them but that is it. I don't particularly like that idea.

@mathewmeconry
Copy link

I agree with you that this isn't a good option. I don't really know how the sdks interact with the server but for the rest api would be maybe the http chunked transfer an idea but you have to keep the connection open until the reindex is finished. Then you don't need to store the logs on the server an can transfer them directly to the client

@clintongormley
Copy link
Contributor

I think 200 is OK. We do what the user asks, ie ignore errors, and so complete successfully

@nik9000
Copy link
Member

nik9000 commented May 19, 2017

http chunked transfer

Elasticsearch is kind of build around request/response sadly and it'd be a huge change to make chunked transfer style things work. Relative to counting errors, that is a moonshot.

@ZombieSmurf
Copy link

A workaround for the problem could be using "ignore_malformed": true for the fields with the bad data.

@mr-mos
Copy link

mr-mos commented Oct 27, 2017

+1 for the fault tolerant reindexing
@nik9000 I would suggest the following:

  • Default (as today): bail out if the first error occurs during reindexing (Status 400)
  • "allowedFailures": [absolute number of tolerated failures] or [percent of allowed failures] (Status stays 200 if count-failures < config)
    In terms of failure-logging: Just return the first 10 error-messages in the array...

@falcorocks
Copy link

falcorocks commented Nov 9, 2017

Guys I need this, +1 for @ZombieSmurf... it's such an easy solution

Temporary fix:
Before reindexing manually create the destination index like the following

PUT dest-index
{
  "settings": {
    "index.mapping.ignore_malformed": true 
  }
}

@ludwigm
Copy link

ludwigm commented Nov 21, 2017

How to retrieve the the results of a Reindexing when the initial API call timed out as it tried to wait for completion? As long as the Reindexing was running I could see the status at: GET _tasks?detailed=true&actions=*reindex, but when the Reindexing was finished I could not see anything anymore in the tasks API or a .tasks index which was mentioned here. I was only able to tell that something was wrong as the document counts in the end were wrong. (ES 5 used here)

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
@bleskes
Copy link
Contributor

bleskes commented Jul 19, 2018

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@rahst12
Copy link

rahst12 commented Nov 13, 2018

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Any progress on this? We have version 6.2.3. When using Curator for a Reindex, Curator seems to ignore documents which don't meet the new Mapping file. For example if a date is malformed, it'll just drop the document from the Reindex, no error message.

@bleskes
Copy link
Contributor

bleskes commented Nov 14, 2018

@rahst12 I'm afraid my previous statement still holds:

We don't currently have any one actively working on this refactoring so it may take a while.

@thePanz
Copy link

thePanz commented Jan 3, 2019

@bleskes any updates on this? Is there a possibility from any experienced ES developer to mentor this that implementation?

@bleskes
Copy link
Contributor

bleskes commented Jan 18, 2019

@thePanz sorry for the late response, I was out and catching up. We're always ready to guide external contributions. I have to warn you though that this will not be a simple one.

@PraneetKhandelwal
Copy link

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Is there any update on this ? Or maybe is there any way we can see logs while reindexing is in progress and stops at any error (it will be at the least useful to identify the document which caused error in reindexing)

@bleskes
Copy link
Contributor

bleskes commented Mar 7, 2019

@PraneetKhandelwal the cause of errors should be returned in the failure field of the response/reindexing result - see here.

@henningandersen henningandersen added the :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down label Apr 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@henningandersen henningandersen removed the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Apr 12, 2019
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@Alsheh
Copy link

Alsheh commented Apr 9, 2021

@bleskes can the reindex API put the failed docs into some DLQ index with a failure reason where dynamic mapping is disabled for the DLQ index and continue indexing the rest of the docs?

@henningandersen
Copy link
Contributor

@Alsheh that is one of the ideas that we have discussed too. We are not actively working on this, but open to external contributions on this.

@mario-paniccia
Copy link

Any update on this please? The Reindex API is great but it's severely limited in value by this stop on failure behaviour. At least for my use case.

@Mikajel
Copy link

Mikajel commented May 16, 2022

Pinging this - still an issue in 2022. Seems like dirty bash script will have to do.

@WoodyWoodsta
Copy link

So today I wanted to update some mappings on log data that I have accumulated (small index, couple million documents). Even for an index of this small size, reindexing to update mappings is severely painful.

We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.

Generally speaking, reindexing is often a manual job - it's something that has many failure pathways, and at least in my case, it came from a manual update to the mappings. For this reason, there should at least be some opt-in control over the bail behaviour, such that I can run with immediate bail if I want to check that I have configured the index settings correctly, but disable or loosen bailing and drop failing documents after I'm sure that my mappings/configuration is what I want. Index reconfiguration can include designing for data to be dropped if it doesn't fit the new configuration.

@GutZuFusss
Copy link

Still a major pain point in 2023.

@Oddly
Copy link

Oddly commented Dec 20, 2023

Agreed. We have to setup a separate pipeline in Logstash to ingest specific indexes and have them go through the pipeline again just so the failures get pushed to the DLQ instead of aborting the whole reindex.
An option to send the failed documents to another index (DLQ) would be great in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement stalled Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests