Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Replicate *.liv file may cause performance issue #3929

Open
Tracked by #2194
hydrogen666 opened this issue Jul 16, 2022 · 2 comments
Open
Tracked by #2194
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@hydrogen666
Copy link

In document replication scenario, *.liv file are only be written to disk when flush operation (Lucene commit) is performed. But in segment replication, the *.liv file must be written every refresh (by setting writeAllDeletes to true in DirectoryReader#open method).

It may not cause any problem in append-only scenario (no delete will be issued in old segment). But in update scenario, as long as there is a delete operation performed in old segment, primary shard's refresh will write full liv bitmap to disk and replicate to replica shard.

*.liv file may be very large in some merged segment (for example *.liv file for a segment with 16,000,000 docs takes up ~2MB disk space). Differ with segment data file, we cannot reuse old *.liv file when new *.liv file is generated, even if only one doc is deleted in segment, we must replicate the full *.liv file. So in segrep, write and replicate *.liv file may introduce greater network and CPU (write and load *.liv file) load.

Several ways to fix this issue:

  1. Write diff other than full bitmap when refresh is performed
  2. Compress liv doc file with LZ4 or zstd
@hydrogen666 hydrogen666 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 16, 2022
@Jeevananthan-23
Copy link

Adding some already tested results belonging to bandwidth on the wire

Performance:
Early performance tests show improvements with segment replication enabled. This run using OpenSearch benchmark showed a ~40-45% drop in CPU and Memory usage, a 19% drop in p99 latency and a 57% increase in p100 throughput.

Instance type: m5.xlarge
Cluster Details: 3 Nodes with 6 shards and 1 replica each.
Test Dataset: Stackoverflow for 3 test iterations with 2 warmup iterations.

IOPS:
Document Replication: (Read 852k + Write 71k) / 1hr = 256 IOPS
Segment Replication: (Read 145k + Write 1M) / 1 hr = 318 IOPS

Total Bandwidth used:
Document Replication: 527 Gb
Segment Replication: 929 Gb

@anasalkouz
Copy link
Member

@Poojita-Raj Any update on this?

@Bukhtawar Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Todo
Development

No branches or pull requests

5 participants