Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Large Shard Count Scalability Issues #77466

Open
73 of 97 tasks
original-brownbear opened this issue Sep 9, 2021 · 3 comments
Open
73 of 97 tasks

Fix Large Shard Count Scalability Issues #77466

original-brownbear opened this issue Sep 9, 2021 · 3 comments
Assignees
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Meta release highlight :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Security Meta label for security team

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Sep 9, 2021

This meta issue tracks known issues with scaling clusters to large numbers of shards.

@original-brownbear original-brownbear added >bug Meta :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC :Data Management/Other labels Sep 9, 2021
@elasticmachine elasticmachine added Team:Security Meta label for security team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Data Management Meta label for data/management team labels Sep 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 16, 2021
This change improves the parsing of LifecycleExecutionState from IndexMetadata custom data
by avoiding containsKey(...) call and in case there is no custom data then return a blank
LifecycleExecutionState instance.

Relates to elastic#77466
martijnvg added a commit that referenced this issue Sep 16, 2021
This change improves the parsing of LifecycleExecutionState from IndexMetadata custom data
by avoiding containsKey(...) call and in case there is no custom data then return a blank
LifecycleExecutionState instance.

Relates to #77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 16, 2021
This change improves the parsing of LifecycleExecutionState from IndexMetadata custom data
by avoiding containsKey(...) call and in case there is no custom data then return a blank
LifecycleExecutionState instance.

Relates to elastic#77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 16, 2021
…hen running a policy.

Sometimes the parsing done by `getCurrentStep()` method is unnecessary, because the
method calling the `getCurrentStep()` method has already parsed a `LifecycleExecutionState`
instance and can just provide that.

Relates to elastic#77466
elasticsearchmachine pushed a commit that referenced this issue Sep 16, 2021
This change improves the parsing of LifecycleExecutionState from IndexMetadata custom data
by avoiding containsKey(...) call and in case there is no custom data then return a blank
LifecycleExecutionState instance.

Relates to #77466
martijnvg added a commit that referenced this issue Sep 16, 2021
…hen running a policy. (#77863)

Sometimes the parsing done by `getCurrentStep()` method is unnecessary, because the
method calling the `getCurrentStep()` method has already parsed a `LifecycleExecutionState`
instance and can just provide that.

Relates to #77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 16, 2021
…hen running a policy. (elastic#77863)

Sometimes the parsing done by `getCurrentStep()` method is unnecessary, because the
method calling the `getCurrentStep()` method has already parsed a `LifecycleExecutionState`
instance and can just provide that.

Relates to elastic#77466
elasticsearchmachine pushed a commit that referenced this issue Sep 16, 2021
…hen running a policy. (#77863) (#77882)

Sometimes the parsing done by `getCurrentStep()` method is unnecessary, because the
method calling the `getCurrentStep()` method has already parsed a `LifecycleExecutionState`
instance and can just provide that.

Relates to #77466
original-brownbear added a commit that referenced this issue Sep 29, 2021
Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well.

Relates #77466 

Closes #78246
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Sep 29, 2021
…ic#78390)

Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around elastic#78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well.

Relates elastic#77466 

Closes elastic#78246
original-brownbear added a commit that referenced this issue Oct 5, 2022
…outing (#90556)

Saw these strings use up tens of MB on a large cluster's master node during snapshotting
and responding to indices stats requests.
Also, this speeds up node id comparisons in the snapshot shards service and snapshots
allocation decider.

relates #77466
original-brownbear added a commit that referenced this issue Oct 5, 2022
This can take O(10s) for tens of thousands of shards, we have to fork it.
relates #77466
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 5, 2022
…ic#90651)

This can take O(10s) for tens of thousands of shards, we have to fork it.
relates elastic#77466
elasticsearchmachine pushed a commit that referenced this issue Oct 5, 2022
… (#90664)

This can take O(10s) for tens of thousands of shards, we have to fork it.
relates #77466
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 6, 2023
BroadcastReplicationAction derivatives (`POST /<indices>/_refresh` and
`POST /<indices>/_flush`) are pretty inefficient when targeting high
shard counts due to how `TransportBroadcastReplicationAction` works:

- It computes the list of all target shards up-front on the calling
  (transport) thread.

- It eagerly sends one request for every target shard in a tight loop on
  the calling (transport) thread.

- It accumulates responses in a `CopyOnWriteArrayList` which takes
  quadratic work to populate, even though nothing reads this list until
  it's fully populated.

- It then mostly discards the accumulated responses, keeping only the
  total number of shards, the number of successful shards, and a list of
  any failures.

- Each failure is wrapped up in a `ReplicationResponse.ShardInfo.Failure`
  but then unwrapped at the end to be re-wrapped in a
  `DefaultShardOperationFailedException`.

This commit fixes all this:

- It avoids allocating a list of all target shards, instead iterating
  over the target indices and generating shard IDs on the fly.

- The computation of the list of shards, and the sending of the
  per-shard requests, now happens on the relevant threadpool (`REFRESH`
  or `FLUSH`) rather than a transport thread.

- The per-shard requests are now throttled, with a meaningful yet fairly
  generous concurrency limit of `#(data nodes) * 10`.

- Rather than accumulating the full responses for later processing we
  track the counts and failures directly.

- The failures are tracked in a regular `ArrayList`, avoiding the
  accidentally-quadratic complexity.

- The failures are tracked in their final form, skipping the
  unwrap-and-rewrap step at the end.

Relates elastic#77466
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 13, 2023
BroadcastReplicationAction derivatives (`POST /<indices>/_refresh` and
`POST /<indices>/_flush`) are pretty inefficient when targeting high
shard counts due to how `TransportBroadcastReplicationAction` works:

- It computes the list of all target shards up-front on the calling
  (transport) thread.

- It accumulates responses in a `CopyOnWriteArrayList` which takes
  quadratic work to populate, even though nothing reads this list until
it's fully populated.

- It then mostly discards the accumulated responses, keeping only the
  total number of shards, the number of successful shards, and a list of
any failures.

- Each failure is wrapped up in a
  `ReplicationResponse.ShardInfo.Failure` but then unwrapped at the end
to be re-wrapped in a `DefaultShardOperationFailedException`.

This commit fixes all this:

- The computation of the list of shards, and the sending of the
  per-shard requests, now happens on the relevant threadpool (`REFRESH`
or `FLUSH`) rather than a transport thread.

- The failures are tracked in a regular `ArrayList`, avoiding the
  accidentally-quadratic complexity.

- Rather than accumulating the full responses for later processing we
  track the counts and failures directly.

- The failures are tracked in their final form, skipping the
  unwrap-and-rewrap step at the end.

Relates elastic#77466
Relates elastic#92729
elasticsearchmachine pushed a commit that referenced this issue Jan 13, 2023
BroadcastReplicationAction derivatives (`POST /<indices>/_refresh` and
`POST /<indices>/_flush`) are pretty inefficient when targeting high
shard counts due to how `TransportBroadcastReplicationAction` works:

- It computes the list of all target shards up-front on the calling (transport) thread.

- It accumulates responses in a `CopyOnWriteArrayList` which takes quadratic work to populate, even though nothing reads this list until it's fully populated.

- It then mostly discards the accumulated responses, keeping only the total number of shards, the number of successful shards, and a list of any failures.

- Each failure is wrapped up in a `ReplicationResponse.ShardInfo.Failure` but then unwrapped at the end to be re-wrapped in a `DefaultShardOperationFailedException`.

This commit fixes all this:

- The computation of the list of shards, and the sending of the per-shard requests, now happens on the relevant threadpool (`REFRESH` or `FLUSH`) rather than a transport thread.

- The failures are tracked in a regular `ArrayList`, avoiding the accidentally-quadratic complexity.

- Rather than accumulating the full responses for later processing we track the counts and failures directly.

- The failures are tracked in their final form, skipping the unwrap-and-rewrap step at the end.

Relates #77466 Relates #92729
@dakrone dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed :Data Management/Other labels Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Meta release highlight :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Security Meta label for security team
Projects
None yet
Development

No branches or pull requests

7 participants