Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider enabling low-level search cancellation by default #26258

Closed
jpountz opened this issue Aug 17, 2017 · 9 comments · Fixed by #42291
Closed

Consider enabling low-level search cancellation by default #26258

jpountz opened this issue Aug 17, 2017 · 9 comments · Fixed by #42291
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories

Comments

@jpountz
Copy link
Contributor

jpountz commented Aug 17, 2017

Low-level search cancellation is currently an opt-in, meaning that probably nobody is turning it on. Now that we reduced the overhead of low-level search cancellation (#25776), maybe we should investigate enabling it by default. We'd need to benchmark some cheap queries that match many documents to measure how much of an impact it has now.

cc @danielmitterdorfer

@jpountz jpountz added the :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. label Aug 17, 2017
@dakrone
Copy link
Member

dakrone commented Aug 17, 2017

I happen to have a testing environment already up and running (I was using it to test adaptive replica selection), I'll run the PMC benchmarks with and without low level cancellation and see what sort of difference it makes!

@dakrone
Copy link
Member

dakrone commented Aug 17, 2017

Here are some benchmarks:

LLC == "Low Level Cancellation".

This was the PMC data set with the rate limited removed with 0 replicas, 5 primary shards, 5 data nodes with 31gb RAM and 16 CPUs, 1 client node with the same specs as the data nodes. All requests were sent to the client node.

Benchmarks without throughput limiting:

Metric Operation Without LLC With LLC Unit % Change
Indexing time 69.0543 67.1531 min -2.7531957
Merge time 32.5057 33.1396 min 1.9501195
Refresh time 3.0202 3.09842 min 2.5898947
Flush time 13.7072 13.384 min -2.3578849
Merge throttle time 23.6543 24.4959 min 3.5579155
Total Young Gen GC 39.65 40.26 s 1.5384615
Total Old Gen GC 0 0 s 0
Heap used for segments 81.1707 81.3525 MB 0.22397244
Heap used for doc values 0.0737076 0.0646324 MB -12.312435
Heap used for terms 67.8418 68.0241 MB 0.26871339
Heap used for norms 0.0424194 0.045166 MB 6.4748676
Heap used for points 6.07154 6.07156 MB 0.000329406
Heap used for stored fields 7.14123 7.14708 MB 0.081918661
Segment count 144 153 6.25
Min Throughput index-append 1138.78 1150.53 docs/s 1.0318060
Median Throughput index-append 1300.01 1263.07 docs/s -2.8415166
Max Throughput index-append 1391.57 1368.03 docs/s -1.6916145
50th percentile latency index-append 1866.75 1469.37 ms -21.287264
90th percentile latency index-append 8509.47 7025.3 ms -17.441392
99th percentile latency index-append 14910.7 19914.9 ms 33.561134
100th percentile latency index-append 19114.5 21885.8 ms 14.498417
50th percentile service time index-append 1866.75 1469.37 ms -21.287264
90th percentile service time index-append 8509.47 7025.3 ms -17.441392
99th percentile service time index-append 14910.7 19914.9 ms 33.561134
100th percentile service time index-append 19114.5 21885.8 ms 14.498417
error rate index-append 0 0 % 0
Min Throughput force-merge 0.251753 0.222001 ops/s -11.817933
Median Throughput force-merge 0.251753 0.222001 ops/s -11.817933
Max Throughput force-merge 0.251753 0.222001 ops/s -11.817933
100th percentile latency force-merge 3972.12 4504.46 ms 13.401911
100th percentile service time force-merge 3972.12 4504.46 ms 13.401911
error rate force-merge 0 0 % 0
Min Throughput default 1198.23 582.877 ops/s -51.355166
Median Throughput default 1547.12 1329.73 ops/s -14.051269
Max Throughput default 1602.74 1408.59 ops/s -12.113630
50th percentile latency default 51.9799 56.801 ms 9.2749313
90th percentile latency default 95.3086 105.505 ms 10.698300
99th percentile latency default 146.749 154.775 ms 5.4692025
99.9th percentile latency default 203.909 179.453 ms -11.993585
99.99th percentile latency default 217.73 199.428 ms -8.4058237
100th percentile latency default 218.723 202.156 ms -7.5744206
50th percentile service time default 51.9799 56.801 ms 9.2749313
90th percentile service time default 95.3086 105.505 ms 10.698300
99th percentile service time default 146.749 154.775 ms 5.4692025
99.9th percentile service time default 203.909 179.453 ms -11.993585
99.99th percentile service time default 217.73 199.428 ms -8.4058237
100th percentile service time default 218.723 202.156 ms -7.5744206
error rate default 0 0 % 0
Min Throughput term 953.598 1042.09 ops/s 9.2798013
Median Throughput term 1232.18 1255.39 ops/s 1.8836534
Max Throughput term 1309.56 1306.32 ops/s -0.24741134
50th percentile latency term 57.3508 59.5699 ms 3.8693445
90th percentile latency term 173.486 121.373 ms -30.038735
99th percentile latency term 279.74 296.849 ms 6.1160363
99.9th percentile latency term 337.124 405.088 ms 20.159941
99.99th percentile latency term 421.182 439.819 ms 4.4249279
100th percentile latency term 423.586 452.4 ms 6.8023967
50th percentile service time term 57.3508 59.5699 ms 3.8693445
90th percentile service time term 173.486 121.373 ms -30.038735
99th percentile service time term 279.74 296.849 ms 6.1160363
99.9th percentile service time term 337.124 405.088 ms 20.159941
99.99th percentile service time term 421.182 439.819 ms 4.4249279
100th percentile service time term 423.586 452.4 ms 6.8023967
error rate term 0 0 % 0
Min Throughput phrase 1095.37 854.846 ops/s -21.958242
Median Throughput phrase 1321.89 1062.03 ops/s -19.658217
Max Throughput phrase 1380.84 1179.74 ops/s -14.563599
50th percentile latency phrase 56.475 63.5914 ms 12.600974
90th percentile latency phrase 134.319 149.337 ms 11.180846
99th percentile latency phrase 265.491 295.219 ms 11.197366
99.9th percentile latency phrase 351.242 344.19 ms -2.0077326
99.99th percentile latency phrase 412.284 383.846 ms -6.8976725
100th percentile latency phrase 426.213 401.919 ms -5.6999669
50th percentile service time phrase 56.475 63.5914 ms 12.600974
90th percentile service time phrase 134.319 149.337 ms 11.180846
99th percentile service time phrase 265.491 295.219 ms 11.197366
99.9th percentile service time phrase 351.242 344.19 ms -2.0077326
99.99th percentile service time phrase 412.284 383.846 ms -6.8976725
100th percentile service time phrase 426.213 401.919 ms -5.6999669
error rate phrase 0 0 % 0
Min Throughput articles_monthly_agg_uncached 489.986 398.845 ops/s -18.600736
Median Throughput articles_monthly_agg_uncached 731.845 659.2 ops/s -9.9262822
Max Throughput articles_monthly_agg_uncached 762.921 679.818 ops/s -10.892740
50th percentile latency articles_monthly_agg_uncached 12.4133 13.6625 ms 10.063400
90th percentile latency articles_monthly_agg_uncached 13.9322 15.0212 ms 7.8164253
99th percentile latency articles_monthly_agg_uncached 19.5369 48.8689 ms 150.13641
99.9th percentile latency articles_monthly_agg_uncached 73.4286 80.5955 ms 9.7603659
99.99th percentile latency articles_monthly_agg_uncached 75.2789 82.1339 ms 9.1061373
100th percentile latency articles_monthly_agg_uncached 75.42 85.6355 ms 13.544816
50th percentile service time articles_monthly_agg_uncached 12.4133 13.6625 ms 10.063400
90th percentile service time articles_monthly_agg_uncached 13.9322 15.0212 ms 7.8164253
99th percentile service time articles_monthly_agg_uncached 19.5369 48.8689 ms 150.13641
99.9th percentile service time articles_monthly_agg_uncached 73.4286 80.5955 ms 9.7603659
99.99th percentile service time articles_monthly_agg_uncached 75.2789 82.1339 ms 9.1061373
100th percentile service time articles_monthly_agg_uncached 75.42 85.6355 ms 13.544816
error rate articles_monthly_agg_uncached 0 0 % 0
Min Throughput articles_monthly_agg_cached 3915.87 3697.59 ops/s -5.5742402
Median Throughput articles_monthly_agg_cached 3998.18 3805.4 ops/s -4.8216939
Max Throughput articles_monthly_agg_cached 4080.48 3913.21 ops/s -4.0992726
50th percentile latency articles_monthly_agg_cached 2.12909 2.11471 ms -0.67540592
90th percentile latency articles_monthly_agg_cached 2.69022 2.6613 ms -1.0750050
99th percentile latency articles_monthly_agg_cached 4.11145 4.87149 ms 18.485936
99.9th percentile latency articles_monthly_agg_cached 58.0363 63.997 ms 10.270641
99.99th percentile latency articles_monthly_agg_cached 66.5185 79.0046 ms 18.770868
100th percentile latency articles_monthly_agg_cached 66.655 79.2316 ms 18.868202
50th percentile service time articles_monthly_agg_cached 2.12909 2.11471 ms -0.67540592
90th percentile service time articles_monthly_agg_cached 2.69022 2.6613 ms -1.0750050
99th percentile service time articles_monthly_agg_cached 4.11145 4.87149 ms 18.485936
99.9th percentile service time articles_monthly_agg_cached 58.0363 63.997 ms 10.270641
99.99th percentile service time articles_monthly_agg_cached 66.5185 79.0046 ms 18.770868
100th percentile service time articles_monthly_agg_cached 66.655 79.2316 ms 18.868202
error rate articles_monthly_agg_cached 0 0 % 0

Benchmarks with throughput limiting

And here's the original benchmark that uses a capped 20 ops/s limiting

Metric Operation Without LLC With LLC Unit % Change
Indexing time 77.7095 67.8821 min -12.646330
Merge time 47.3979 45.3001 min -4.4259345
Refresh time 7.17313 7.14718 min -0.36176676
Flush time 10.6962 9.6371 min -9.9016473
Merge throttle time 29.5265 27.4508 min -7.0299561
Total Young Gen GC 92.173 51.868 s -43.727556
Total Old Gen GC 0.038 0 s -100.
Heap used for segments 80.0692 79.184 MB -1.1055437
Heap used for doc values 0.0355225 0.0582314 MB 63.928215
Heap used for terms 66.7801 65.8814 MB -1.3457602
Heap used for norms 0.0418091 0.0366211 MB -12.408782
Heap used for points 6.07152 6.07146 MB -0.0009882204
Heap used for stored fields 7.14022 7.13637 MB -0.053919907
Segment count 142 125 -11.971831
Min Throughput index-append 1045.24 1086.32 docs/s 3.9301978
Median Throughput index-append 1129.41 1145.54 docs/s 1.4281793
Max Throughput index-append 1269.59 1293.22 docs/s 1.8612308
50th percentile latency index-append 1638.46 1650.5 ms 0.73483637
90th percentile latency index-append 8871.93 8910.91 ms 0.43936325
99th percentile latency index-append 21092.1 21100.9 ms 0.041721782
100th percentile latency index-append 27369.6 23305.9 ms -14.847495
50th percentile service time index-append 1638.46 1650.5 ms 0.73483637
90th percentile service time index-append 8871.93 8910.91 ms 0.43936325
99th percentile service time index-append 21092.1 21100.9 ms 0.041721782
100th percentile service time index-append 27369.6 23305.9 ms -14.847495
error rate index-append 0 0 % 0
Min Throughput force-merge 0.305927 0.302477 ops/s -1.1277200
Median Throughput force-merge 0.305927 0.302477 ops/s -1.1277200
Max Throughput force-merge 0.305927 0.302477 ops/s -1.1277200
100th percentile latency force-merge 3268.73 3306.02 ms 1.1408100
100th percentile service time force-merge 3268.73 3306.02 ms 1.1408100
error rate force-merge 0 0 % 0
Min Throughput default 20.0108 20.0106 ops/s -0.0009994603
Median Throughput default 20.0161 20.016 ops/s -0.0004995978
Max Throughput default 20.0315 20.0315 ops/s 0.
50th percentile latency default 10.1292 10.2704 ms 1.3939897
90th percentile latency default 10.8391 10.9334 ms 0.86999843
99th percentile latency default 12.2628 12.4249 ms 1.3218841
99.9th percentile latency default 29.3506 14.5497 ms -50.427930
100th percentile latency default 69.6888 17.3164 ms -75.151818
50th percentile service time default 10.014 10.1641 ms 1.4989015
90th percentile service time default 10.7325 10.8229 ms 0.84230142
99th percentile service time default 12.0121 12.3162 ms 2.5316140
99.9th percentile service time default 19.5976 14.4526 ms -26.253215
100th percentile service time default 69.5873 17.2111 ms -75.266895
error rate default 0 0 % 0
Min Throughput term 20.0107 20.011 ops/s 0.001499198
Median Throughput term 20.0161 20.0163 ops/s 0.000999196
Max Throughput term 20.0321 20.0326 ops/s 0.002495994
50th percentile latency term 9.77052 9.59881 ms -1.7574295
90th percentile latency term 10.5002 10.1316 ms -3.5104093
99th percentile latency term 11.4375 11.3978 ms -0.34710383
99.9th percentile latency term 67.178 56.846 ms -15.380035
100th percentile latency term 85.7844 61.6633 ms -28.118283
50th percentile service time term 9.66603 9.49013 ms -1.8197750
90th percentile service time term 10.3846 10.017 ms -3.5398571
99th percentile service time term 11.3048 10.8906 ms -3.6639304
99.9th percentile service time term 67.0728 56.7324 ms -15.416682
100th percentile service time term 85.6728 61.5588 ms -28.146623
error rate term 0 0 % 0
Min Throughput phrase 20.011 20.0111 ops/s 0.000499725
Median Throughput phrase 20.0166 20.0167 ops/s 0.000499585
Max Throughput phrase 20.0321 20.0334 ops/s 0.006489584
50th percentile latency phrase 9.01531 8.7157 ms -3.3233466
90th percentile latency phrase 9.79927 9.24202 ms -5.6866481
99th percentile latency phrase 17.848 10.6875 ms -40.119341
99.9th percentile latency phrase 71.0084 75.4452 ms 6.2482749
100th percentile latency phrase 86.8217 79.5964 ms -8.3219978
50th percentile service time phrase 8.90464 8.60509 ms -3.3639765
90th percentile service time phrase 9.64173 9.12288 ms -5.3812957
99th percentile service time phrase 11.0576 10.0716 ms -8.9169440
99.9th percentile service time phrase 70.9039 75.3308 ms 6.2435212
100th percentile service time phrase 86.7137 79.4924 ms -8.3277498
error rate phrase 0 0 % 0
Min Throughput articles_monthly_agg_uncached 20.0125 20.0116 ops/s -0.0044971893
Median Throughput articles_monthly_agg_uncached 20.0218 20.0211 ops/s -0.0034961892
Max Throughput articles_monthly_agg_uncached 20.0729 20.0728 ops/s -0.0004981841
50th percentile latency articles_monthly_agg_uncached 12.6751 13.0424 ms 2.8978075
90th percentile latency articles_monthly_agg_uncached 15.7435 16.4689 ms 4.6076158
99th percentile latency articles_monthly_agg_uncached 17.8811 21.1732 ms 18.411060
99.9th percentile latency articles_monthly_agg_uncached 19.1911 26.0265 ms 35.617552
100th percentile latency articles_monthly_agg_uncached 19.4911 33.2746 ms 70.716891
50th percentile service time articles_monthly_agg_uncached 12.5728 12.9147 ms 2.7193624
90th percentile service time articles_monthly_agg_uncached 15.6359 16.3626 ms 4.6476378
99th percentile service time articles_monthly_agg_uncached 17.7741 21.0653 ms 18.516831
99.9th percentile service time articles_monthly_agg_uncached 19.0881 25.9145 ms 35.762596
100th percentile service time articles_monthly_agg_uncached 19.3756 33.1679 ms 71.183860
error rate articles_monthly_agg_uncached 0 0 % 0
Min Throughput articles_monthly_agg_cached 20.0157 20.0157 ops/s 0.
Median Throughput articles_monthly_agg_cached 20.0267 20.0268 ops/s 0.000499333
Max Throughput articles_monthly_agg_cached 20.0924 20.0921 ops/s -0.0014931019
50th percentile latency articles_monthly_agg_cached 3.83617 3.77054 ms -1.7108209
90th percentile latency articles_monthly_agg_cached 4.1231 4.08703 ms -0.87482719
99th percentile latency articles_monthly_agg_cached 4.45246 4.58526 ms 2.9826208
99.9th percentile latency articles_monthly_agg_cached 7.40099 41.7944 ms 464.71364
100th percentile latency articles_monthly_agg_cached 40.8482 50.4352 ms 23.469822
50th percentile service time articles_monthly_agg_cached 3.72916 3.65878 ms -1.8872883
90th percentile service time articles_monthly_agg_cached 4.00954 3.97538 ms -0.85196806
99th percentile service time articles_monthly_agg_cached 4.34392 4.46619 ms 2.8147388
99.9th percentile service time articles_monthly_agg_cached 7.27338 41.6745 ms 472.97295
100th percentile service time articles_monthly_agg_cached 40.7493 50.2926 ms 23.419543
error rate articles_monthly_agg_cached 0 0 % 0

@jpountz
Copy link
Contributor Author

jpountz commented Aug 21, 2017

I'm ignoring the 99.9th and 100th percentiles which look a bit noisy, but otherwise the performance hit looks very low.

@polyfractal
Copy link
Contributor

So @dakrone ran a bunch more benchmarks for me and I crunched the search numbers. The differences were not statistically significant in most cases. The values that are statistically significantly different (p<0.05) are below. Note, this is still with n=5 so not a huge population and probably still noisy:

With Throttling Enabled

Interestingly, all these values went down slightly with LLC enabled, although all 4-5% so maybe still noise?

Metric Operation Off Mean On Mean Percent Change p
50th percentile latency articles_monthly_agg_uncached 12.34896 11.78486 -4.57% 0.004345
90th percentile latency articles_monthly_agg_uncached 14.42484 13.92992 -3.43% 0.005527
99th percentile latency articles_monthly_agg_uncached 16.96462 16.24404 -4.25% 0.026407
50th percentile service time articles_monthly_agg_uncached 12.23786 11.6722 -4.62% 0.004196
90th percentile service time articles_monthly_agg_uncached 14.31138 13.82378 -3.41% 0.005657
99th percentile service time articles_monthly_agg_uncached 16.84864 16.10544 -4.41% 0.021026

With Throttling Disabled

With throttling disabled, it looks like there are some real effects on phrase and agg_cached. Throughput goes down, latency is up:

Metric Operation Off Mean On Mean Percent Change p
Min Throughput phrase 110.5914 97.37636 -11.95% 0.000924
Median Throughput phrase 113.5752 103.6116 -8.77% 0.002422
Max Throughput phrase 115.0246 105.9648 -7.88% 0.008309
50th percentile latency phrase 8.302894 8.952224 7.82% 0.006021
90th percentile latency phrase 9.133522 10.32883 13.09% 0.002684
99th percentile latency phrase 11.56266 13.4402 16.24% 0.015439
50th percentile service time phrase 8.302894 8.952224 7.82% 0.006021
90th percentile service time phrase 9.133522 10.32883 13.09% 0.002684
99th percentile service time phrase 11.56266 13.4402 16.24% 0.015439
Metric Operation Off Mean On Mean Percent Change p
Min Throughput articles_monthly_agg_cached 375.41 294.1092 -21.66% 0.001043
Median Throughput articles_monthly_agg_cached 394.1252 330.3738 -16.18% 0.000537
Max Throughput articles_monthly_agg_cached 407.5668 337.36 -17.23% 0.000930
50th percentile latency articles_monthly_agg_cached 2.404374 2.912346 21.13% 0.000386
90th percentile latency articles_monthly_agg_cached 2.810786 3.480044 23.81% 0.000033
50th percentile service time articles_monthly_agg_cached 2.404374 2.912346 21.13% 0.000386
90th percentile service time articles_monthly_agg_cached 2.810786 3.480044 23.81% 0.000033

@polyfractal
Copy link
Contributor

Oh, and note this is n=5 of analyzing the summary values (e.g. 5 laps of each), not the raw latencies. I didn't realize Rally could save all the response latencies when I asked @dakrone to run the test. I can setup an environment and rerun with Rally configured to save latencies and analyze the whole population if we want. That'd give a better analysis.

@bleskes bleskes added :Search/Search Search-related issues that do not fall into other categories and removed :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Mar 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jpountz
Copy link
Contributor Author

jpountz commented Mar 5, 2019

One additional data point: Kibana is starting to think about using X-Opaque-Id to their search requests in order to be able to cancel ongoing requests eg. if the user moves to a different tab. Doing so would be much better if low-level cancellation was enabled by default.

If the penalty feels too significant to enable low-level cancellation by default for all requests, another option could be to only enable it if an X-Opaque-Id is provided, which is a good indication that the user wants to be able to cancel ongoing requests?

I'm adding the team-discuss label to rediscuss this issue.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 11, 2019

We discussed this issue in the search meeting, some points were made:

  • some queries have a non-negligible slowdown based on the above benchmarks, changing the default might be deceptive to some users (or is it noise?)
  • having a per-request switch is possible and would allow users to only opt in for low-level cancellation for requests that they might need to cancel at the cost of a greater API surface.

We agreed that a next step should be to run benchmarks again as this might have interesting interaction with changes to query execution now that we don't count all hits by default anymore.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 11, 2019

Related thought we had in this discussion: we should look into integrating ExitableDirectoryReader as well to be able to cancel query rewriting, multi-term queries and Points-based queries. Plus maybe we should add additional checks for task cancellation for frozen indices.

markharwood added a commit to markharwood/elasticsearch that referenced this issue May 22, 2019
Benchmarking on worst-case queries (max agg on match_all or popular-term query with large index) was not noticeably slower.

Closes elastic#26258
markharwood added a commit that referenced this issue May 22, 2019
Benchmarking on worst-case queries (max agg on match_all or popular-term query with large index) was not noticeably slower.

Closes #26258
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
Benchmarking on worst-case queries (max agg on match_all or popular-term query with large index) was not noticeably slower.

Closes elastic#26258
markharwood added a commit to markharwood/elasticsearch that referenced this issue Jun 7, 2019
Benchmarking on worst-case queries (max agg on match_all or popular-term query with large index) was not noticeably slower.

Closes elastic#26258
markharwood added a commit that referenced this issue Jun 7, 2019
Benchmarking on worst-case queries (max agg on match_all or popular-term query with large index) was not noticeably slower.

Closes #26258
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants