-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Indexed symbol search: definitive performance regression #5364
Comments
Some notes on resource consumption of zoekt-webserver in this case:
During the testing, this is how CPU usage looked (the spike shows 100% of 7.7 CPUs consumed): And how memory usage looked (GiB): Residential memory was only about half that (indicating maybe we can further tune GOGC or something): No swapping was occurring, though: And network bytes in/out in mbps: |
Note: I suspect the easiest way to make forward progress on this will be to run vegeta against k8s.sgdev.org or a local dev server and find a similarly poor-performing query. |
This is surprising. My best guess is we are using more ram due to storing symbol information offsets in memory now (previously we never ran ctags). This leaves less RAM for the FS cache => more often paging in memory when reading the posting lists. But that is just a guess. Best thing to do is run a profile with a regressed pod. What exactly is the CPU and RAM limits/requests btw? Any chance we were indexing while these tests were running (in either case). Another thing is you don't specify An interesting thing to first rule out is to run the new zoekt-webserver, but only on v15 indexes. And see if we still have a perf regression. This may indicate some other surprising change that happened, rather than it being due to symbols. @kzh can you comment on this and create a plan of action. We can sync sometime this week and do some investigation together if you like. I see you are on this from some chat on slack, but some visibility for others would be great (so please comment here :) ) I suggest we have a disk snapshot of v15 and v16 indexes so we can test both relatively easily. Alternatively we can temporarily run indexed-search twice, and just update the |
Not saying this isn't a possible explanation, but the graph above shows there is no swapping taking place (at least according to cadviser / docker).
Since this is a deploy-sourcegraph-docker deployment, they are both the same (just limits).
Given the CPU bump, it's especially surprising I am seeing these results.
I am not specifying
If I understand correctly, this isn't possible without a code change because the new zoekt-webserver deletes the v15 indexes and generates the v16 ones, is that right? I assume we just need to add an env var or create a debug image with that disabled? This update may also interest you:
Also note that:
|
Initially, I thought this was the case too. However, we did another load test after Stephen turned off the feature flag and waited for all the repositories to reindex. This would mean the bulk of what was added (the ctags information) to the index file would no longer be present. There were similar results showing performance regression for text search. I'm not sure how long it would take for
Next steps will be running profiling via pprof while the vegeta load tests are happening. This will probably be done for the
From this, we should be able to figure out what is hanging during the high QPS (100) tests.
I'll try this out. However, the v16 indexes with ctags disabled would be very similar in storage/memory/cpu to the v15 indexes anyways. |
Plan sounds good. I'd caution against making the vegeta concurrency too high, since profiles become harder to read. Use the lowest QPS rate with a regression. cc @sourcegraph/core-services |
Oh hello there, new tool, I like it! |
Definitely agree, I plan to redo these load tests with this actually to eliminate that concern since our behavior can be quite erratic with even 1 QPS on very slow / heavy searches. I found @tsenart added support for sub-1 QPS rates in tsenart/vegeta#423 last night which I've found quite useful -- i.e.:
maybe I'll send a PR to note that in vegeta's README. |
It's not sub-1 QPS. It simply sends requests as fast as possible, when |
yeah, that was poor wording on my part -- thanks for the clarification. I said sub-1 QPS because I was thinking specifically about |
Update: I'm currently working on a bunch of memory optimizations for Zoekt symbol search. This regression is probably due to what Keegan mentioned about increased paging since there are now more contenders for memory (being the symbols data). This regression can probably be easily fixed by increasing memory, but before falling back to that option, I want to see if the memory optimizations will do the trick or at least significantly lessen the necessary memory increase. Some random observations that I found interesting:
|
Update: Kevin and I spent a while re-running some independent benchmarks, it appears as though the cases where performance looked worse was entirely due to factors I hadn't accounted for in my benchmarks. After accounting for the too-high QPS, avoiding running the requests in parallel, and a few other things, it does appear perf is generally the same. I'll post a further update here tomorrow with numbers from an overnight load test. |
Took me longer than I thought to make sense of the numbers, but I am finally confident enough in them to close this out. Definitive isn't so definitive, and load testing is H.A.R.D.. Final load test comparison between v3.6.2 and v3.7.1: https://docs.google.com/spreadsheets/d/1oPzePjD8YLrnppLm3nk46h48_Cxipz4_QqRMBYaIOYQ/edit?usp=sharing Quoting my summary in there:
Thank you @kzh and @keegancsmith for you diligence here in helping me look into this (and sorry this turned out to be a waste of your time!) -- I've learned a lot here so I can do this better in the future and have some ideas to make our search more load-testable seamlessly I'll propose later. |
Workaround
When upgrading to Sourcegraph v3.7.1, set the following in your site configuration to avoid opting into the new symbol search implementation:
Problem
After we introduced indexed symbol search in v3.7.1, both symbol and text search performance appears to have regressed. This issue is to investigate and fix that.
Note in the following that:
P95 change
(orange) column indicates the percentage change in 95th percentile query duration after migrating from Sourcegraph v3.5.2 to v3.7.1. Positive numbers are performance regressions, negative numbers are performance gains.All queries were ran against theoa.sgdev.org over 10h period using vegeta and the CSV comparison report was generated using kingkai. The instance has 12,981 repositories In total cloned and indexed fully, and 42GB of data was generated by vegeta (happy to upload this somewhere if anyone is interested). I eliminated all external factors that I could not contribute to the code / design itself, i.e.:
Results: https://docs.google.com/spreadsheets/d/12YpKTP58FOIUkFqrdBotlDJHyinu7-itiY-Hd0xuCoY/edit#gid=2136071970
The text was updated successfully, but these errors were encountered: