Reindex API : improve robustness in case of error #22471

cnico · 2017-01-06T15:31:31Z

Hello,

I did a migration from elasticsearch 2.3.4 to 5.1.1 by following the migration guide.
The migration went perfectly well and I updated my mapping in order to use the new keyword type instead of the string not analyzed old one.

So I wanted to reindex all my index and I encountered the 2 following problems :

some of my documents had id of length longer than 512 characters and elasticsearch 5.1.1 does not accept it while a reindexation.
some of my documents had fields whose name is an empty string : elasticsearch 5.1.1 refuses to reindex such field.

So my reindex task suddenly stops, leaving lots of my documents not being reindexed because of the error on one document.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

In my case, i was not able to correct the data of my old indexes so I decided not to reindex them finally...

Regards,

clintongormley · 2017-01-10T14:42:20Z

Hi @cnico

Sorry to hear about your troubles with reindexing.

The improvement I recommend is to introduce some robustness of the reindex processing that in case of failure of some documents, continues normally with all others documents present in the index.

This is problematic because reindexing might target billions of documents, all of which might have errors. We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.

Linked to this behavior, it would be great to add in the reindex API, the possibility to get the result message of the reindex task once finished (including the number of successes and failures). Indeed, when run through kibana devmode, the json response is not displayed because of client timeout for large index.

This can be done today by running the reindex job with ?wait_for_completion=false. You get back a task ID which can be passed to the task API to get the job status. The final status is stored in the .tasks index and will remain there until you delete it.

cnico · 2017-01-12T16:29:17Z

Hi @clintongormley,

For the first point, I disagree with you because I think it is completely useless to have a system that if an error happens, simple leaves the task partly completed and partly uncompleted, so with an unknown state.

Even if an index contains billions of documents, the reindex could have many strategies relating to error handling with the choice of the user that knows its data and what he prefers to be done in case of error.
The strategies could be :

simple stop as actually and tell which document caused the failure.
ignore every errors without care
ignore errors and trace the document ids that are the cause of the failure, for example in a dedicated index
computes the error rate (per minutes, per index, per server, per shard to be determined), and if higher than a given rate, stops.

I hope you will reconsider your point of view in order to improve robustness.

mathewmeconry · 2017-03-17T14:18:00Z

Hi @clintongormley

I agree with @cnico. It would be great to have a param like the conflicts which allows me to ignore errors. My usecase is that I wan't to reindex one bucket but the api fails every time because of the error "Can't get text on a START_OBJECT at 1:251"

Cidan · 2017-05-01T15:49:45Z

I agree with @cnico as well here -- reindex is really cool, but a huge pain to use if just a single error happens. When reindexing billions of documents, even a single error causes you to start all over again (assuming the error is transient). It's a huge pain, and requires us to use external ETL to re-index.

shimonste · 2017-05-19T09:42:40Z

Hi @clintongormley,

I also agree with @cnico, The reindex api is really useful and works great until their is one error, we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.
I also hope that you reconsider your point of view

clintongormley · 2017-05-19T10:03:45Z

we need to index billions of documents and it fails every time because several of bad documents.
I don't see why the reindex simply doesn't log the error like any other index request.

What if all of your documents have errors? Where would we log billions of errors?

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

nik9000 · 2017-05-19T15:23:20Z

The only thing we could do is to count errors, and to abort after at least a certain number of errors have occurred. @nik9000 what do you think?

This was in the first version of reindex that I had actually. We decided the complexity wasn't worth it at the time I believe. But we can still do it if we want.

If we did this, what http code would we return if we have errors but not enough to abort. Traditionally we've returned 200 in those cases. There is a 207 Multi-Status HTTP response but I think it is pretty mixed up with webdav so maybe it is trouble. Not sure!

mathewmeconry · 2017-05-19T15:28:30Z

What do you think about a force option as a first step to just ignore all errors and do what is possible?

nik9000 · 2017-05-19T15:29:42Z

A force option would require logging the errors instead of returning them. We could count them but that is it. I don't particularly like that idea.

mathewmeconry · 2017-05-19T15:44:51Z

I agree with you that this isn't a good option. I don't really know how the sdks interact with the server but for the rest api would be maybe the http chunked transfer an idea but you have to keep the connection open until the reindex is finished. Then you don't need to store the logs on the server an can transfer them directly to the client

clintongormley · 2017-05-19T15:58:53Z

I think 200 is OK. We do what the user asks, ie ignore errors, and so complete successfully

nik9000 · 2017-05-19T16:01:11Z

http chunked transfer

Elasticsearch is kind of build around request/response sadly and it'd be a huge change to make chunked transfer style things work. Relative to counting errors, that is a moonshot.

ZombieSmurf · 2017-10-20T10:47:46Z

A workaround for the problem could be using "ignore_malformed": true for the fields with the bad data.

mr-mos · 2017-10-27T11:39:44Z

+1 for the fault tolerant reindexing
@nik9000 I would suggest the following:

Default (as today): bail out if the first error occurs during reindexing (Status 400)
"allowedFailures": [absolute number of tolerated failures] or [percent of allowed failures] (Status stays 200 if count-failures < config)
In terms of failure-logging: Just return the first 10 error-messages in the array...

falcorocks · 2017-11-09T14:37:48Z

Guys I need this, +1 for @ZombieSmurf... it's such an easy solution

Temporary fix:
Before reindexing manually create the destination index like the following

PUT dest-index
{
  "settings": {
    "index.mapping.ignore_malformed": true 
  }
}

ludwigm · 2017-11-21T16:42:10Z

How to retrieve the the results of a Reindexing when the initial API call timed out as it tried to wait for completion? As long as the Reindexing was running I could see the status at: GET _tasks?detailed=true&actions=*reindex, but when the Reindexing was finished I could not see anything anymore in the tasks API or a .tasks index which was mentioned here. I was only able to tell that something was wrong as the document counts in the end were wrong. (ES 5 used here)

bleskes · 2018-07-19T20:46:47Z

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

rahst12 · 2018-11-13T19:37:27Z

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Any progress on this? We have version 6.2.3. When using Curator for a Reindex, Curator seems to ignore documents which don't meet the new Mapping file. For example if a date is malformed, it'll just drop the document from the Reindex, no error message.

bleskes · 2018-11-14T09:07:53Z

@rahst12 I'm afraid my previous statement still holds:

We don't currently have any one actively working on this refactoring so it may take a while.

thePanz · 2019-01-03T15:41:40Z

@bleskes any updates on this? Is there a possibility from any experienced ES developer to mentor this that implementation?

bleskes · 2019-01-18T16:09:10Z

@thePanz sorry for the late response, I was out and catching up. We're always ready to guide external contributions. I have to warn you though that this will not be a simple one.

PraneetKhandelwal · 2019-03-07T07:39:03Z

A short update: we are intending to reimplement the reindex api using the sequence numbers infrastructure added in 6.0. That infra would allow the reindex job to pause and then continue from where it was. That in turn will allow us to stop on error and report it to the user, allow it to be fixed and continue. We can then also consider allowing using to ignore errors and move on. We don't currently have any one actively working on this refactoring so it may take a while.

@bleskes - Is there any update on this ? Or maybe is there any way we can see logs while reindexing is in progress and stops at any error (it will be at the least useful to identify the document which caused error in reindexing)

bleskes · 2019-03-07T09:27:51Z

@PraneetKhandelwal the cause of errors should be returned in the failure field of the response/reindexing result - see here.

elasticmachine · 2019-04-12T07:50:28Z

Pinging @elastic/es-distributed

Alsheh · 2021-04-09T02:52:49Z

@bleskes can the reindex API put the failed docs into some DLQ index with a failure reason where dynamic mapping is disabled for the DLQ index and continue indexing the rest of the docs?

henningandersen · 2021-04-22T07:40:16Z

@Alsheh that is one of the ideas that we have discussed too. We are not actively working on this, but open to external contributions on this.

mario-paniccia · 2021-05-03T13:43:13Z

Any update on this please? The Reindex API is great but it's severely limited in value by this stop on failure behaviour. At least for my use case.

Mikajel · 2022-05-16T17:59:39Z

Pinging this - still an issue in 2022. Seems like dirty bash script will have to do.

WoodyWoodsta · 2022-10-14T19:24:34Z

So today I wanted to update some mappings on log data that I have accumulated (small index, couple million documents). Even for an index of this small size, reindexing to update mappings is severely painful.

We need to report these errors so that you can take action, but accumulating billions of errors isn't practical. Instead, we bail out on the first error.

Generally speaking, reindexing is often a manual job - it's something that has many failure pathways, and at least in my case, it came from a manual update to the mappings. For this reason, there should at least be some opt-in control over the bail behaviour, such that I can run with immediate bail if I want to check that I have configured the index settings correctly, but disable or loosen bailing and drop failing documents after I'm sure that my mappings/configuration is what I want. Index reconfiguration can include designing for data to be dropped if it doesn't fit the new configuration.

GutZuFusss · 2023-10-31T13:32:45Z

Still a major pain point in 2023.

Oddly · 2023-12-20T10:39:35Z

Agreed. We have to setup a separate pipeline in Logstash to ingest specific indexes and have them go through the pipeline again just so the failures get pushed to the DLQ instead of aborting the whole reindex.
An option to send the failed documents to another index (DLQ) would be great in this case.

clintongormley closed this as completed Jan 10, 2017

clintongormley added :Reindex API discuss labels May 19, 2017

clintongormley reopened this May 19, 2017

berglh mentioned this issue Aug 11, 2017

Reindex API: Reindex task es_rejected_execution_exception search queue failure #26153

Closed

lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018

colings86 added the >enhancement label Apr 24, 2018

bleskes added the stalled label Jul 19, 2018

tomcallahan removed the discuss label Jul 19, 2018

henningandersen added the :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down label Apr 12, 2019

henningandersen removed the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Apr 12, 2019

sravfeyn mentioned this issue Jan 31, 2020

[CEP] Use native Elasticsearch reindexing for index changes dimagi/commcare-hq#26516

Open

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

kununu-drocha mentioned this issue May 17, 2022

Add someway to check reindex errors [UX] #85793

Open

maxhniebergall mentioned this issue Oct 2, 2024

[ML] Overview of reindex issues with NLP #113948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex API : improve robustness in case of error #22471

Reindex API : improve robustness in case of error #22471

cnico commented Jan 6, 2017

clintongormley commented Jan 10, 2017

cnico commented Jan 12, 2017

mathewmeconry commented Mar 17, 2017

Cidan commented May 1, 2017

shimonste commented May 19, 2017

clintongormley commented May 19, 2017

nik9000 commented May 19, 2017

mathewmeconry commented May 19, 2017

nik9000 commented May 19, 2017

mathewmeconry commented May 19, 2017

clintongormley commented May 19, 2017

nik9000 commented May 19, 2017

ZombieSmurf commented Oct 20, 2017

mr-mos commented Oct 27, 2017 •

edited

Loading

falcorocks commented Nov 9, 2017 •

edited

Loading

ludwigm commented Nov 21, 2017 •

edited

Loading

bleskes commented Jul 19, 2018

rahst12 commented Nov 13, 2018

bleskes commented Nov 14, 2018 •

edited

Loading

thePanz commented Jan 3, 2019

bleskes commented Jan 18, 2019 •

edited

Loading

PraneetKhandelwal commented Mar 7, 2019

bleskes commented Mar 7, 2019

elasticmachine commented Apr 12, 2019

Alsheh commented Apr 9, 2021

henningandersen commented Apr 22, 2021

mario-paniccia commented May 3, 2021

Mikajel commented May 16, 2022

WoodyWoodsta commented Oct 14, 2022

GutZuFusss commented Oct 31, 2023

Oddly commented Dec 20, 2023

Reindex API : improve robustness in case of error #22471

Reindex API : improve robustness in case of error #22471

Comments

cnico commented Jan 6, 2017

clintongormley commented Jan 10, 2017

cnico commented Jan 12, 2017

mathewmeconry commented Mar 17, 2017

Cidan commented May 1, 2017

shimonste commented May 19, 2017

clintongormley commented May 19, 2017

nik9000 commented May 19, 2017

mathewmeconry commented May 19, 2017

nik9000 commented May 19, 2017

mathewmeconry commented May 19, 2017

clintongormley commented May 19, 2017

nik9000 commented May 19, 2017

ZombieSmurf commented Oct 20, 2017

mr-mos commented Oct 27, 2017 • edited Loading

falcorocks commented Nov 9, 2017 • edited Loading

ludwigm commented Nov 21, 2017 • edited Loading

bleskes commented Jul 19, 2018

rahst12 commented Nov 13, 2018

bleskes commented Nov 14, 2018 • edited Loading

thePanz commented Jan 3, 2019

bleskes commented Jan 18, 2019 • edited Loading

PraneetKhandelwal commented Mar 7, 2019

bleskes commented Mar 7, 2019

elasticmachine commented Apr 12, 2019

Alsheh commented Apr 9, 2021

henningandersen commented Apr 22, 2021

mario-paniccia commented May 3, 2021

Mikajel commented May 16, 2022

WoodyWoodsta commented Oct 14, 2022

GutZuFusss commented Oct 31, 2023

Oddly commented Dec 20, 2023

mr-mos commented Oct 27, 2017 •

edited

Loading

falcorocks commented Nov 9, 2017 •

edited

Loading

ludwigm commented Nov 21, 2017 •

edited

Loading

bleskes commented Nov 14, 2018 •

edited

Loading

bleskes commented Jan 18, 2019 •

edited

Loading