POC: Translog replay - performance impact and time required #2494

sachinpkale · 2022-03-17T04:27:24Z

Goal

To understand time required to replay the translog of specific size. Before finalizing the approach for remote store (For more info, refer POC), we need to understand the size limit we can have on the remote translog.

Steps

Start indexing the data and index till translog reaches X MB in size.
1. Auto commit happens on 512MB size so if X > 512 MB, we need to change translog flush size accordingly.
Run some search queries including doc count, aggregations etc. to understand the search data set. Let's call it Query Set 1
Stop OpenSearch process on all the data nodes
Delete segment files which are not part of commit yet.
1. This means translog has the data for these segments
Start OpenSearch process on all the three nodes
Check logs on primary for translog recovery and time taken for the same
Run Query Set 1 and validate the results
Repeat for different values of X (512MB, 2GB, 5GB) and with different traffic pattern on the cluster.

The text was updated successfully, but these errors were encountered:

saratvemulapalli · 2022-03-18T15:54:56Z

@sachinpkale are you working on this proposal ?

sachinpkale · 2022-03-23T08:41:30Z

@saratvemulapalli Yes, I am working on this.

sachinpkale · 2022-04-04T07:04:46Z

Setup:

3 nodes cluster with each node of type r5.2xlarge (EC2 instance)
No dedicated master nodes
Index is created with number_of_shards=1 and number_of_replicas=2
Only indexing (write) traffic. Also, no traffic before recovery is complete (Step 5)

Observations:

No data loss is observed after replaying data from translog
Data replay time required
- Translog with size 400MB - 1 min 6 seconds - 66 seconds
- Translog with size 2GB - 7 mins 7 seconds - 427 seconds

sachinpkale · 2022-04-04T08:36:00Z

Conclusion: Based on the experiment performed, translog replay is a time consuming operation and the recovery time is directly proportional to the translog size. We need to make sure to consider this while designing remote translog and may block further writes if remote translog reaches X MB in size.

Design proposal will have more details around it: #2700

sachinpkale added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 17, 2022

saratvemulapalli removed the untriaged label Mar 18, 2022

saratvemulapalli added the Storage:Durability Issues and PRs related to the durability framework label Mar 18, 2022

saratvemulapalli assigned sachinpkale Mar 22, 2022

sachinpkale closed this as completed May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Translog replay - performance impact and time required #2494

POC: Translog replay - performance impact and time required #2494

sachinpkale commented Mar 17, 2022

saratvemulapalli commented Mar 18, 2022

sachinpkale commented Mar 23, 2022

sachinpkale commented Apr 4, 2022

sachinpkale commented Apr 4, 2022

POC: Translog replay - performance impact and time required #2494

POC: Translog replay - performance impact and time required #2494

Comments

sachinpkale commented Mar 17, 2022

Goal

Steps

saratvemulapalli commented Mar 18, 2022

sachinpkale commented Mar 23, 2022

sachinpkale commented Apr 4, 2022

sachinpkale commented Apr 4, 2022