Increase remote recovery thread pool size #10750

andrross · 2023-10-19T17:05:45Z

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help.

Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams.

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-10-19T17:27:49Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28461/
CommitID: dc8a839
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-19T17:38:34Z

Compatibility status:

Checks if related components are compatible with change 5499b8b

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

github-actions · 2023-10-19T17:42:15Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28463/
CommitID: 088dbf7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/indices/recovery/RecoverySettings.java

github-actions · 2023-10-19T18:18:11Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28475/
CommitID: a60da9b
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help. Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams. Signed-off-by: Andrew Ross <[email protected]>

github-actions · 2023-10-19T19:40:04Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28480/
CommitID: 5155700
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-19T20:08:45Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/28482/
CommitID: 5499b8b

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help. Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams. Signed-off-by: Andrew Ross <[email protected]> (cherry picked from commit 1e28738) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help. Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams. (cherry picked from commit 1e28738) Signed-off-by: Andrew Ross <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help. Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams. Signed-off-by: Andrew Ross <[email protected]>

The remote recovery thread pool does blocking I/O when downloading files, so the "half processor count max 10" was definitely too small. This can be shown by triggering recoveries on a node that is also doing segment replication, and the replication lag will increase due to contention on that thread pool. Some amount of contention is inevitable, but the change here to increase the download thread pool, and also limit the concurrent usage of that thread pool by any single recovery/replication to 25% of the threads does help. Long term, we can improve this even further by moving to fully async I/O to avoid blocking threads in the application on draining InputStreams. Signed-off-by: Andrew Ross <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>

andrross added backport 2.x Backport to 2.x branch skip-changelog labels Oct 19, 2023

andrross force-pushed the update-remote-recovery-thread-pool-size branch from dc8a839 to 088dbf7 Compare October 19, 2023 17:08

mch2 approved these changes Oct 19, 2023

View reviewed changes

kotwanikunal approved these changes Oct 19, 2023

View reviewed changes

reta reviewed Oct 19, 2023

View reviewed changes

server/src/main/java/org/opensearch/indices/recovery/RecoverySettings.java Show resolved Hide resolved

andrross force-pushed the update-remote-recovery-thread-pool-size branch from 088dbf7 to a60da9b Compare October 19, 2023 18:07

andrross force-pushed the update-remote-recovery-thread-pool-size branch from a60da9b to 5155700 Compare October 19, 2023 18:57

andrross force-pushed the update-remote-recovery-thread-pool-size branch from 5155700 to 5499b8b Compare October 19, 2023 19:26

andrross merged commit 1e28738 into opensearch-project:main Oct 20, 2023
14 checks passed

andrross deleted the update-remote-recovery-thread-pool-size branch October 20, 2023 17:17

opensearch-trigger-bot bot mentioned this pull request Oct 20, 2023

[Backport 2.x] Increase remote recovery thread pool size #10796

Merged

ashking94 mentioned this pull request Nov 1, 2023

[BUG] Flaky org.opensearch.threadpool.ScalingThreadPoolTests.testScalingThreadPoolConfiguration test #7530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase remote recovery thread pool size #10750

Increase remote recovery thread pool size #10750

andrross commented Oct 19, 2023

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 19, 2023 •

edited

Loading

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 19, 2023

Increase remote recovery thread pool size #10750

Increase remote recovery thread pool size #10750

Conversation

andrross commented Oct 19, 2023

Check List

github-actions bot commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 19, 2023 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 19, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 19, 2023 •

edited

Loading