Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase remote recovery thread pool size #10750

Conversation

andrross
Copy link
Member

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 19, 2023

Compatibility status:

Checks if related components are compatible with change 5499b8b

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross andrross force-pushed the update-remote-recovery-thread-pool-size branch from 088dbf7 to a60da9b Compare October 19, 2023 18:07
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross andrross force-pushed the update-remote-recovery-thread-pool-size branch from a60da9b to 5155700 Compare October 19, 2023 18:57
The remote recovery thread pool does blocking I/O when downloading
files, so the "half processor count max 10" was definitely too small.
This can be shown by triggering recoveries on a node that is also doing
segment replication, and the replication lag will increase due to
contention on that thread pool. Some amount of contention is inevitable,
but the change here to increase the download thread pool, and also limit
the concurrent usage of that thread pool by any single
recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O
to avoid blocking threads in the application on draining InputStreams.

Signed-off-by: Andrew Ross <[email protected]>
@andrross andrross force-pushed the update-remote-recovery-thread-pool-size branch from 5155700 to 5499b8b Compare October 19, 2023 19:26
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross andrross merged commit 1e28738 into opensearch-project:main Oct 20, 2023
14 checks passed
@andrross andrross deleted the update-remote-recovery-thread-pool-size branch October 20, 2023 17:17
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 20, 2023
The remote recovery thread pool does blocking I/O when downloading
files, so the "half processor count max 10" was definitely too small.
This can be shown by triggering recoveries on a node that is also doing
segment replication, and the replication lag will increase due to
contention on that thread pool. Some amount of contention is inevitable,
but the change here to increase the download thread pool, and also limit
the concurrent usage of that thread pool by any single
recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O
to avoid blocking threads in the application on draining InputStreams.

Signed-off-by: Andrew Ross <[email protected]>
(cherry picked from commit 1e28738)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2 pushed a commit that referenced this pull request Oct 21, 2023
The remote recovery thread pool does blocking I/O when downloading
files, so the "half processor count max 10" was definitely too small.
This can be shown by triggering recoveries on a node that is also doing
segment replication, and the replication lag will increase due to
contention on that thread pool. Some amount of contention is inevitable,
but the change here to increase the download thread pool, and also limit
the concurrent usage of that thread pool by any single
recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O
to avoid blocking threads in the application on draining InputStreams.


(cherry picked from commit 1e28738)

Signed-off-by: Andrew Ross <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Oct 23, 2023
The remote recovery thread pool does blocking I/O when downloading
files, so the "half processor count max 10" was definitely too small.
This can be shown by triggering recoveries on a node that is also doing
segment replication, and the replication lag will increase due to
contention on that thread pool. Some amount of contention is inevitable,
but the change here to increase the download thread pool, and also limit
the concurrent usage of that thread pool by any single
recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O
to avoid blocking threads in the application on draining InputStreams.

Signed-off-by: Andrew Ross <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
The remote recovery thread pool does blocking I/O when downloading
files, so the "half processor count max 10" was definitely too small.
This can be shown by triggering recoveries on a node that is also doing
segment replication, and the replication lag will increase due to
contention on that thread pool. Some amount of contention is inevitable,
but the change here to increase the download thread pool, and also limit
the concurrent usage of that thread pool by any single
recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O
to avoid blocking threads in the application on draining InputStreams.

Signed-off-by: Andrew Ross <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants