-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance Idea] Optimize Indexing tail latencies by disabling checkPendingFlushUpdate flag on DocumentWriter #11628
Comments
Benchmark results for Dataset: Server configuration:
Command:
Glossary
Baseline CPU Profile: Candidate CPU Profile: Candidate JVM: Candidate CPU: Baseline OSB Benchmark numbersMetric,Task,Value,Unit OSB Benchmark numbers for candidateMetric,Task,Value,Unit Although tail latency i.e. p99 has decreased and there's a slight improvement in indexing throughput but other metrics like p90, merge throttling time etc. suffered. What are your thoughts on the numbers and overall idea of disabling flush param @mgodwan / @backslasht / @shwetathareja ? |
Thanks @CaptainDredge for the benchmark and the suggested tuning. I'm still going through the overall optimization and measurements. In the meanwhile, I've a few questions if you can help answer them:
|
Sure I think I need to track few more metrics during benchmark run to address above Questions. Will try to add those and do 1-2 runs |
Few questions on top of what @mgodwan asked :
|
Redid the benchmarks with additional metric tracking and got much consistent results Benchmark results for Dataset: Server configuration:
Command:
Glossary
CPU Profile comparison [ A -> candidate, B -> Baseline ] Refreshes are higher in baseline due to more number of segments in baseline Macro CPU: Lock profiles: Flush metrics: Number of segments less in candidate which is expected since flushes are occurring less frequently Indexing writer is holding segments in memory a little longer but the memory occupied is not too concerning (<=1Gb) Baseline OSB Benchmark numbersMetric,Task,Value,Unit Cumulative indexing time of primary shards,,160.47803333333334,min Min cumulative indexing time across primary shards,,160.47803333333334,min Median cumulative indexing time across primary shards,,160.47803333333334,min Max cumulative indexing time across primary shards,,160.47803333333334,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,71.74595,min Cumulative merge count of primary shards,,69, Min cumulative merge time across primary shards,,71.74595,min Median cumulative merge time across primary shards,,71.74595,min Max cumulative merge time across primary shards,,71.74595,min Cumulative merge throttle time of primary shards,,28.590616666666666,min Min cumulative merge throttle time across primary shards,,28.590616666666666,min Median cumulative merge throttle time across primary shards,,28.590616666666666,min Max cumulative merge throttle time across primary shards,,28.590616666666666,min Cumulative refresh time of primary shards,,5.5779,min Cumulative refresh count of primary shards,,107, Min cumulative refresh time across primary shards,,5.5779,min Median cumulative refresh time across primary shards,,5.5779,min Max cumulative refresh time across primary shards,,5.5779,min Cumulative flush time of primary shards,,2.91075,min Cumulative flush count of primary shards,,32, Min cumulative flush time across primary shards,,2.91075,min Median cumulative flush time across primary shards,,2.91075,min Max cumulative flush time across primary shards,,2.91075,min Total Young Gen GC time,,3.05,s Total Young Gen GC count,,174, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,23.8011878663674,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,29, Min Throughput,index,70413.46,docs/s Mean Throughput,index,71458.73,docs/s Median Throughput,index,71394.10,docs/s Max Throughput,index,72696.09,docs/s 50th percentile latency,index,605.4159615626703,ms 90th percentile latency,index,657.1996043337641,ms 99th percentile latency,index,3482.827952473957,ms 99.9th percentile latency,index,7201.034707031424,ms 99.99th percentile latency,index,8073.713661718741,ms 100th percentile latency,index,8079.45068359375,ms 50th percentile service time,index,605.4159615626703,ms 90th percentile service time,index,657.1996043337641,ms 99th percentile service time,index,3482.827952473957,ms 99.9th percentile service time,index,7201.034707031424,ms 99.99th percentile service time,index,8073.713661718741,ms 100th percentile service time,index,8079.45068359375,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,270079.65625,ms 100th percentile service time,wait-until-merges-finish,270079.65625,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,162.08225,min Min cumulative indexing time across primary shards,,162.08225,min Median cumulative indexing time across primary shards,,162.08225,min Max cumulative indexing time across primary shards,,162.08225,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,66.11128333333333,min Cumulative merge count of primary shards,,70, Min cumulative merge time across primary shards,,66.11128333333333,min Median cumulative merge time across primary shards,,66.11128333333333,min Max cumulative merge time across primary shards,,66.11128333333333,min Cumulative merge throttle time of primary shards,,24.2181,min Min cumulative merge throttle time across primary shards,,24.2181,min Median cumulative merge throttle time across primary shards,,24.2181,min Max cumulative merge throttle time across primary shards,,24.2181,min Cumulative refresh time of primary shards,,5.670816666666667,min Cumulative refresh count of primary shards,,108, Min cumulative refresh time across primary shards,,5.670816666666667,min Median cumulative refresh time across primary shards,,5.670816666666667,min Max cumulative refresh time across primary shards,,5.670816666666667,min Cumulative flush time of primary shards,,2.7515666666666667,min Cumulative flush count of primary shards,,33, Min cumulative flush time across primary shards,,2.7515666666666667,min Median cumulative flush time across primary shards,,2.7515666666666667,min Max cumulative flush time across primary shards,,2.7515666666666667,min Total Young Gen GC time,,3.295,s Total Young Gen GC count,,209, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,28.78564189001918,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,34, Min Throughput,index,69258.45,docs/s Mean Throughput,index,70292.67,docs/s Median Throughput,index,70238.87,docs/s Max Throughput,index,71074.70,docs/s 50th percentile latency,index,612.6130574218281,ms 90th percentile latency,index,662.8810661411829,ms 99th percentile latency,index,3486.1739264787902,ms 99.9th percentile latency,index,7181.957399902394,ms 99.99th percentile latency,index,8293.0933571286,ms 100th percentile latency,index,8451.947265625,ms 50th percentile service time,index,612.6130574218281,ms 90th percentile service time,index,662.8810661411829,ms 99th percentile service time,index,3486.1739264787902,ms 99.9th percentile service time,index,7181.957399902394,ms 99.99th percentile service time,index,8293.0933571286,ms 100th percentile service time,index,8451.947265625,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,219999.90625,ms 100th percentile service time,wait-until-merges-finish,219999.90625,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,162.34881666666666,min Min cumulative indexing time across primary shards,,162.34881666666666,min Median cumulative indexing time across primary shards,,162.34881666666666,min Max cumulative indexing time across primary shards,,162.34881666666666,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,72.96123333333334,min Cumulative merge count of primary shards,,67, Min cumulative merge time across primary shards,,72.96123333333334,min Median cumulative merge time across primary shards,,72.96123333333334,min Max cumulative merge time across primary shards,,72.96123333333334,min Cumulative merge throttle time of primary shards,,31.61315,min Min cumulative merge throttle time across primary shards,,31.61315,min Median cumulative merge throttle time across primary shards,,31.61315,min Max cumulative merge throttle time across primary shards,,31.61315,min Cumulative refresh time of primary shards,,5.835483333333333,min Cumulative refresh count of primary shards,,108, Min cumulative refresh time across primary shards,,5.835483333333333,min Median cumulative refresh time across primary shards,,5.835483333333333,min Max cumulative refresh time across primary shards,,5.835483333333333,min Cumulative flush time of primary shards,,2.607966666666667,min Cumulative flush count of primary shards,,31, Min cumulative flush time across primary shards,,2.607966666666667,min Median cumulative flush time across primary shards,,2.607966666666667,min Max cumulative flush time across primary shards,,2.607966666666667,min Total Young Gen GC time,,3.482,s Total Young Gen GC count,,221, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,28.772654306143522,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,43, Min Throughput,index,69013.34,docs/s Mean Throughput,index,70167.70,docs/s Median Throughput,index,70127.00,docs/s Max Throughput,index,71273.93,docs/s 50th percentile latency,index,612.7077060643543,ms 90th percentile latency,index,663.1674384570314,ms 99th percentile latency,index,3537.240791015618,ms 99.9th percentile latency,index,6925.188898437524,ms 99.99th percentile latency,index,8229.463450390302,ms 100th percentile latency,index,8429.31640625,ms 50th percentile service time,index,612.7077060643543,ms 90th percentile service time,index,663.1674384570314,ms 99th percentile service time,index,3537.240791015618,ms 99.9th percentile service time,index,6925.188898437524,ms 99.99th percentile service time,index,8229.463450390302,ms 100th percentile service time,index,8429.31640625,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,442901.125,ms 100th percentile service time,wait-until-merges-finish,442901.125,ms error rate,wait-until-merges-finish,0.00,%Candidate OSB Benchmark numbersMetric,Task,Value,Unit There's 20% improvement in throughput, 10% average latency and 80% tail latency(p99) improvements What are your thoughts on the numbers and overall idea of disabling flush param now @mgodwan / @backslasht / @shwetathareja ? Next steps: I'm currently benchmarking with a custom workload where we can test parallel indexing and search performance. I expect search performance to be a lot better in this case while indexing is going on |
Thanks @CaptainDredge for the numbers. Can you highlight how
|
Thanks for the benchmarks @CaptainDredge , the nos. definitely look encouraging. Its a good surprise to see overall throughput improvement as well. We should understand better if over throughput improvement is coming from lesser segments and in turn less merges combined with lower latencies. Few more question to understand the results better:
Also on the trade off, the instances where the disk is slow and flushes are impacted (and can get piled up), what will happen in those cases? Currently, write thread help to catch up. At some point we will need to throttle the indexing? |
I believe IndexingMemoryController should kick in to throttle the writes in such a case. OpenSearch/server/src/main/java/org/opensearch/indices/IndexingMemoryController.java Lines 434 to 436 in b4306b6
That said, it will be helpful to see the benchmark behavior this change brings in when the resource under contention is disk IOPS/throughput. |
I benchmarked with a custom workload with total uncompressed size of around 600Gb distributed equally among 3 indices. It Indexes the 50% document corpus initially. Document ids are unique so all index operations are append only. After, ingestion occurs again with remaining 5[0]% document corpus along with a couple of filter queries and aggregations which are run parallely. I used generous warmup time period for ingestion ~ 30 mins and good amount of warmup iteration ~ 100 for search queries. No explicit refresh interval was set Baseline: checkPendingFlushOnUpdate = true This includes one optimization which I identified through lock profiling where we short circuited a condition to prevent write threads getting blocked to check if the size of pending flushes queue > 0. Opened the issue in lucene for the same. This overall brought an increase of 7% throughput and 6-7% latency improvement. Hereafter the results contain this optimization as well Test procedure: Initial index append similar to nyc Test Procedure: Concurrent Index append i.e. ingesting 5% doc corpus parallely with search Test Procedure: Filter queries running parallely with ingestion Test Procedure: Aggregation queries run parallely with ingestion Average Latency: Left of Separator -> Canidate, Right of separator -> Baseline Flush Sizes(Mb): Flush Time(ms) : Number of flushes completed over time is still same in both baseline and candidate Notable conclusions from the experiment
Answering few Important Qns
|
Thanks @CaptainDredge for the detailed analysis. I see that we're seeing constant improvement in indexing throughput and tail latencies (as expected since write threads can now work on actual indexing instead of helping the refresh operation for flushing segments) but at the same time, P50 latency seems to have degraded for indexing (as you've pointed out). Do we understand the cause behind it now? It may help to add some trace logging in the active indexing path (i.e. queuing + processing within lucene upto IndexingChain#processDocuments to understand this further) Please update when you're able to valid the hypothesis you've suggested as well.
Is there any metric confirming this hypoethesis? Did you track the segment sizes over time for your concurrent indexing-search workload?
Doesn't it also reduce the number of segments written? Will
Will this cause an increase in the overall time it takes for documents to be visible for search as well? Is there a hybrid approach we can try where we let refresh thread take care of largest segments whereas write threads still continue to help with smaller segments so that the additional work write threads are doing is reduced through some setting (compared to 0 in the current experiments). Also, since refreshes for shards of a single index on a node are serailized, if we plan to incorporate this change, we should look into better scheduling of refreshes across shards as well (while ensuring fairness). In case of multiple shards for the same index on a node, we may be able to reduce index level lag by moving the concurrency to a higher level. |
I did a per request trace analysis to find out the p50 time differences we were seeing in candidate and baseline. I divided the indexing code flow path in 3 mutually exclusive parts 1.) IndexIntoLucene ( this function is the one which delegates indexing to lucene index writer for indexing) 2. translog#Add operation 3. MarkSeqNoAsProcessed. Initially the suspects were translog add and markSeqNo as processed operation because a higher locking time was being observed in collected JFR profiles but after adding log tracing on a per request basis their contribution to added p50 latency was minimal although we still see them consuming a little more time in candidate i.e. pendingFlushCheck disabled case. Here are the per request comparison graphs for these 3 parts shardBulkAction time distribution:
IndexIntoLucene time distribution:
Translog Add time distribution:
MarkSeqNo time distribution
From above charts mainly lucene writer is taking more time ( ~ 10%) to complete bulk indexing request which explains the p50 latency differences. Its not feasible currently to add request level tracing at lucene level for further exploration of what operations are taking more time at index writer level to complete. Few other interesting points to note are
The above findings are for the custom workload we had which. does concurrent indexing and search. I experimented further with lesser refresh interval This should be helpful in many cases for that I've raised a PR to expose this setting here Addressing concerns around refresh lags:
I traced async refresh action to find out if we see added lag in refreshes in the candidate case and I found it to be similar in both the cases
cc: @mgodwan |
Is your feature request related to a problem? Please describe
Opensearch triggers a full flush on lucene based on the translog size or maybe if a refresh is triggered(sometimes) or flush API is called.
Sometimes indexing threads end up flushing the segments to the disk when a full flush has been triggered. This incurs latency overhead and reduces the throughput keeping the number of clients same.
Looking deeper, it appears that indexing threads check for pending flushes on update in order to help flushing indexing buffers to disk. We can turn this off by disabling checkPendingFlushUpdate parameter in index config.
Also IndexingMemoryController reduces the total number of indexing threads per shard to 1, in order to throttle indexing when it breaches 1.5 times of indexing buffer. It starts limiting the indexing rate starting with the shard with largest memory accounting.
Describe the solution you'd like
We should ideally keep checkPendingFlushUpdate disable but we do want the indexing threads to start help with flushing somewhere between when flushing starts and throttling starts to restore to old behaviour when flushing is slow.
checkForPendingFlush is a LiveIndexWriterConfig, which means that we can modify it on a live index writer instance. We can follow the existing procedure for enabling this flag used for throttling in IndexingMemoryController- start with the shard that has largest memory accounting and disable it once the memory is below the threshold.
Related component
Indexing:Performance
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: