Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

eedugon · 2021-03-12T10:23:54Z

Elasticsearch version (bin/elasticsearch --version): 7.11.x

Elasticsearch is returning a 400 error response when indexing requests are coming and there's a temporary problem in the creation of a new index:

(status=400): {"type":"illegal_argument_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [3141]/[3000] maximum shards open;"}

This could happen for example when new daily indices are created at 00:00:00 (confirmed) or could potentially happen with ILM rollover (not checked).

This is making the indexer (filebeat in this case) to discard all the data during the problem, as in general the integration agreements with beats team is (I might be wrong about this):

429 response ---> retry
any other 4xx response --> do not retry (it's a client error)
5xx responses --> retry

Considering that having reached the shard limit should be considered a temporary error that should be solved by the administrator (in this case our user had lost temporary one data node and replicas had been migrated to the existing nodes reaching this limit), Elasticsearch should probably send an error code to prevent data loss from clients side (Filebeat in this case).

A similar approach was followed when disk watermark levels are reached and shards are moved to read_only status (we provide a response code to ensure our client will keep retrying until the issue is gone).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-15T19:07:04Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen · 2021-03-24T13:56:01Z

There are a couple of considerations here:

Basics: is 429/503 a good idea here?
For both bulk and create index?
How to handle rollovers, where the rollover fails due to this?

henningandersen · 2021-03-25T09:19:51Z

We discussed this as a team and our conclusion and comments are:

Yes, we think a retryable error code is the right response to both bulk and create index requests rather than a 400 (it is not a bad request). It is an overload situation that an admin or automation need to deal with.
For rollover, it would seem appropriate to block indexing into the alias or data stream that failed to rollover due to running out of shards. Blocking the write index is less ideal, since it also prevents direct updates to the write index. The block should result in a retryable error code upon indexing.
- Notice that autoscaling will likely include scaling based on number of shards in the future. Autoscaling should scale well in advance of hitting the limits. Even with autoscaling though, the system can run out if limits are configured.
System indices should ideally be exempt from the shard limits.
We did not decide whether 429 or 503 is the better choice (none of them are ideal).

We decided we want to involve the @elastic/es-core-features team too to get their input, so adding that label.

dakrone · 2021-04-08T16:04:12Z

Thanks for pinging us Henning, we discussed this as a team also

We agree that 400 is a bad response code for this
For rollover, we've opened Investigate blocking writes to write index when rollover cannot be completed #71485 to discuss that separately so we can split the work into separate discussions
The system indices one I believe should be a separate issue, I opened System indices should be exempt from shard limits #71486 for this
In terms of 429 or 503, we lean slightly on the side of 503, since this is something that may (but not "will") clear up on its own, but would be best requiring user (admin) intervention to solve.

DaveCTurner · 2021-07-14T15:00:33Z

We (the @elastic/es-distributed team) discussed this again today. The only remaining question is whether to use 429 or 503; we didn't have strong opinions either way so @dakrone's previously-mentioned slight preference for a 503 wins.

eedugon added >bug needs:triage Requires assignment of a team area label labels Mar 12, 2021

dnhatn added the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Mar 15, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 15, 2021

dnhatn removed >bug needs:triage Requires assignment of a team area label labels Mar 15, 2021

henningandersen added the team-discuss label Mar 23, 2021

henningandersen added the :Data Management/Indices APIs APIs to create and manage indices and templates label Mar 25, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 25, 2021

This was referenced Apr 8, 2021

Investigate blocking writes to write index when rollover cannot be completed #71485

Open

System indices should be exempt from shard limits #71486

Open

dakrone removed the team-discuss label Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

eedugon commented Mar 12, 2021

elasticmachine commented Mar 15, 2021

henningandersen commented Mar 24, 2021

henningandersen commented Mar 25, 2021

dakrone commented Apr 8, 2021

DaveCTurner commented Jul 14, 2021

Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

Comments

eedugon commented Mar 12, 2021

elasticmachine commented Mar 15, 2021

henningandersen commented Mar 24, 2021

henningandersen commented Mar 25, 2021

dakrone commented Apr 8, 2021

DaveCTurner commented Jul 14, 2021