Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat discarding data when new index cannot be created (error 400 returned by ES) due to max shards allocated #70349

Open
eedugon opened this issue Mar 12, 2021 · 5 comments
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@eedugon
Copy link
Contributor

eedugon commented Mar 12, 2021

Elasticsearch version (bin/elasticsearch --version): 7.11.x

Elasticsearch is returning a 400 error response when indexing requests are coming and there's a temporary problem in the creation of a new index:

(status=400): {"type":"illegal_argument_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [3141]/[3000] maximum shards open;"}

This could happen for example when new daily indices are created at 00:00:00 (confirmed) or could potentially happen with ILM rollover (not checked).

This is making the indexer (filebeat in this case) to discard all the data during the problem, as in general the integration agreements with beats team is (I might be wrong about this):

  • 429 response ---> retry
  • any other 4xx response --> do not retry (it's a client error)
  • 5xx responses --> retry

Considering that having reached the shard limit should be considered a temporary error that should be solved by the administrator (in this case our user had lost temporary one data node and replicas had been migrated to the existing nodes reaching this limit), Elasticsearch should probably send an error code to prevent data loss from clients side (Filebeat in this case).

A similar approach was followed when disk watermark levels are reached and shards are moved to read_only status (we provide a response code to ensure our client will keep retrying until the issue is gone).

@eedugon eedugon added >bug needs:triage Requires assignment of a team area label labels Mar 12, 2021
@dnhatn dnhatn added the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Mar 15, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@dnhatn dnhatn removed >bug needs:triage Requires assignment of a team area label labels Mar 15, 2021
@henningandersen
Copy link
Contributor

There are a couple of considerations here:

  1. Basics: is 429/503 a good idea here?
  2. For both bulk and create index?
  3. How to handle rollovers, where the rollover fails due to this?

@henningandersen
Copy link
Contributor

We discussed this as a team and our conclusion and comments are:

  • Yes, we think a retryable error code is the right response to both bulk and create index requests rather than a 400 (it is not a bad request). It is an overload situation that an admin or automation need to deal with.
  • For rollover, it would seem appropriate to block indexing into the alias or data stream that failed to rollover due to running out of shards. Blocking the write index is less ideal, since it also prevents direct updates to the write index. The block should result in a retryable error code upon indexing.
    • Notice that autoscaling will likely include scaling based on number of shards in the future. Autoscaling should scale well in advance of hitting the limits. Even with autoscaling though, the system can run out if limits are configured.
  • System indices should ideally be exempt from the shard limits.
  • We did not decide whether 429 or 503 is the better choice (none of them are ideal).

We decided we want to involve the @elastic/es-core-features team too to get their input, so adding that label.

@dakrone
Copy link
Member

dakrone commented Apr 8, 2021

Thanks for pinging us Henning, we discussed this as a team also

@DaveCTurner
Copy link
Contributor

We (the @elastic/es-distributed team) discussed this again today. The only remaining question is whether to use 429 or 503; we didn't have strong opinions either way so @dakrone's previously-mentioned slight preference for a 503 wins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

6 participants