Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add Segment Replication Backpressure to documentation #3695

Closed
1 of 4 tasks
Rishikesh1159 opened this issue Apr 6, 2023 · 2 comments · Fixed by #3461 or #3839
Closed
1 of 4 tasks

[DOC] Add Segment Replication Backpressure to documentation #3695

Rishikesh1159 opened this issue Apr 6, 2023 · 2 comments · Fixed by #3461 or #3839
Assignees
Labels
Closed - Complete Issue: Work is done and associated PRs closed Sev3 Medium priority. Content that's missing, driven by dev, PM or the community. v2.7.0

Comments

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Apr 6, 2023

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request and all versions that are affected.

OpenSearch is currently working on new segment replication feature. Segment Replication has been released as an experimental feature in opensearch 2.3. As part of Segment Replication feature we are introducing new Segment Replication Backpressure mechanism.

Implementation of this new segment replication backpressure mechanism is completed and more details can be found here. This feature will be released in opensearch 2.7.

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.

@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Apr 6, 2023

Overview:

  • Segment Replication Backpressure is a rejection mechanism at a per-shard level that dynamically rejects indexing requests when replica shards in your cluster are falling behind primary shards.
  • Segment Replication Backpressure mechanism starts rejecting indexing requests when:
    • More than half of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS.
    • A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS and its current replication lag is over MAX_REPLICATION_TIME_SETTING.
  • With this mechanism we also monitor the replica shards stuck or lagging for long time. When a replica shards are stuck/lagging for more than double the MAX_REPLICATION_TIME_SETTING, we remove these lagging shards and replace them with new replica shards.

Settings:

All settings below are dynamic cluster setting and users can enable or disable settings using PUT _cluster/settings . SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting must be set to true for rest of the backpressure settings to work.

  • SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED is a setting that enables segment replication backpressure mechanism. By default segment replication backpressure is false (disabled).
  • MAX_REPLICATION_TIME_SETTING setting is the maximum time that a replica shard can take to copy from primary. Once MAX_REPLICATION_TIME_SETTING is breached along with MAX_INDEXING_CHECKPOINTS then segment replication backpressure mechanism gets triggered. The default value of this setting is 5 minutes.
  • MAX_INDEXING_CHECKPOINTS setting is the maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once MAX_INDEXING_CHECKPOINTS is breached along with MAX_REPLICATION_TIME_SETTING then segment replication backpressure mechanism gets triggered. The default value of this setting is 4 checkpoints.
  • MAX_ALLOWED_STALE_SHARDS setting is the maximum number of stale replica shards that can exist in a replication group. Once MAX_ALLOWED_STALE_SHARDS is breached then segment replication backpressure mechanism gets triggered. The default value of this setting is ‘.5’ which 50% of a replication group.

API:

GET _cat/segment_replication API is used to fetch metrics related to segment replication backpressure:

shardId       target_node    target_host   checkpoints_behind bytes_behind   current_lag   last_completed_lag   rejected_requests
[index-1][0]     runTask-1    127.0.0.1              0              0b           0s              7ms                    0

→ Parameters: checkpoints_behind and current_lag directly correlate with MAX_INDEXING_CHECKPOINTS and MAX_REPLICATION_TIME_SETTING .

→ These checkpoints_behind and current_lag metrics are taken into consideration when triggering segment replication backpressure mechanism.

@Naarcha-AWS Naarcha-AWS added 1 - Backlog Issue: The issue is unassigned or assigned but not started v2.7.0 and removed untriaged labels Apr 6, 2023
@Naarcha-AWS Naarcha-AWS added this to the v2.7 milestone Apr 6, 2023
@Naarcha-AWS Naarcha-AWS self-assigned this Apr 6, 2023
@anasalkouz anasalkouz moved this from Todo to In Progress in Segment Replication Apr 13, 2023
@Naarcha-AWS
Copy link
Collaborator

Passing this to @ariamarble since this likely has some overlap with the Segment Replication GA PR. #3461

@ariamarble ariamarble added 2 - In progress Issue/PR: The issue or PR is in progress. Sev3 Medium priority. Content that's missing, driven by dev, PM or the community. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Apr 18, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Segment Replication Apr 18, 2023
@ariamarble ariamarble added Closed - Complete Issue: Work is done and associated PRs closed and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed - Complete Issue: Work is done and associated PRs closed Sev3 Medium priority. Content that's missing, driven by dev, PM or the community. v2.7.0
Projects
Status: Done
3 participants