Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Translog replay - performance impact and time required #2494

Closed
sachinpkale opened this issue Mar 17, 2022 · 4 comments
Closed

POC: Translog replay - performance impact and time required #2494

sachinpkale opened this issue Mar 17, 2022 · 4 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework

Comments

@sachinpkale
Copy link
Member

Goal

To understand time required to replay the translog of specific size. Before finalizing the approach for remote store (For more info, refer POC), we need to understand the size limit we can have on the remote translog.

Steps

  1. Start indexing the data and index till translog reaches X MB in size.
    1. Auto commit happens on 512MB size so if X > 512 MB, we need to change translog flush size accordingly.
  2. Run some search queries including doc count, aggregations etc. to understand the search data set. Let's call it Query Set 1
  3. Stop OpenSearch process on all the data nodes
  4. Delete segment files which are not part of commit yet.
    1. This means translog has the data for these segments
  5. Start OpenSearch process on all the three nodes
  6. Check logs on primary for translog recovery and time taken for the same
  7. Run Query Set 1 and validate the results
  8. Repeat for different values of X (512MB, 2GB, 5GB) and with different traffic pattern on the cluster.
@sachinpkale sachinpkale added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 17, 2022
@saratvemulapalli
Copy link
Member

@sachinpkale are you working on this proposal ?

@saratvemulapalli saratvemulapalli added the Storage:Durability Issues and PRs related to the durability framework label Mar 18, 2022
@sachinpkale
Copy link
Member Author

@saratvemulapalli Yes, I am working on this.

@sachinpkale
Copy link
Member Author

Setup:

  • 3 nodes cluster with each node of type r5.2xlarge (EC2 instance)
  • No dedicated master nodes
  • Index is created with number_of_shards=1 and number_of_replicas=2
  • Only indexing (write) traffic. Also, no traffic before recovery is complete (Step 5)

Observations:

  • No data loss is observed after replaying data from translog
  • Data replay time required
    • Translog with size 400MB - 1 min 6 seconds - 66 seconds
    • Translog with size 2GB - 7 mins 7 seconds - 427 seconds

@sachinpkale
Copy link
Member Author

Conclusion: Based on the experiment performed, translog replay is a time consuming operation and the recovery time is directly proportional to the translog size. We need to make sure to consider this while designing remote translog and may block further writes if remote translog reaches X MB in size.

Design proposal will have more details around it: #2700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework
Projects
None yet
Development

No branches or pull requests

2 participants