Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc search is excruciatingly slow on very large crates #131156

Open
zopsicle opened this issue Oct 2, 2024 · 4 comments
Open

rustdoc search is excruciatingly slow on very large crates #131156

zopsicle opened this issue Oct 2, 2024 · 4 comments
Labels
A-rustdoc-search Area: Rustdoc's search feature C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@zopsicle
Copy link
Contributor

zopsicle commented Oct 2, 2024

Steps to reproduce:

  1. Use Firefox on a computer with Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz or comparable CPU.
  2. Go to https://microsoft.github.io/windows-docs-rs. This rustdoc has a search index of 38 MB.
  3. Focus on the search bar and slowly type "ID3D12GraphicsPipelineState".

Expected behavior:

rustdoc quickly shows matching results while typing and when done typing.

Actual behavior:

Web page slows down to a crawl after every delayed keystroke. This is not just due to the search index download, which takes me 1.4 s and can probably not be improved much, but also due to matching the search query against the search index once the download is finished.

Here is Firefox Profiler output: https://share.firefox.dev/4eOgrcE. Of interest are the yellow areas in the graph (the salmon ones are mostly idle GC). Some observations:

  • buildIndex takes about a second.
  • A lot of time seems to be spent in calculating edit distance. This is definitely a useful feature (as can be seen in the example query, "ID3D12GraphicsPipelineState", which has relevant but no exact matches).
  • Some of the edit distance calculations are done by convertNameToId, which is documented to be used only for the In Parameters and In Return Types tabs. If the user is not going to click on those tabs then perhaps these calls are not necessary.
@zopsicle zopsicle added the C-bug Category: This is a bug. label Oct 2, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 2, 2024
@fmease fmease added I-slow Issue: Problems and improvements with respect to performance of generated code. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. A-rustdoc-search Area: Rustdoc's search feature and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 2, 2024
@lolbinarycat
Copy link
Contributor

We should probably do automatic benchmarking of rustdoc search similar to how its done with rustc, it's very easy to accidentally cause huge perf regressions by adding something significant to an inner loop.

i've also often wondered if it would be worth it to start using wasm in rustdoc search...

@GuillaumeGomez
Copy link
Member

There is benchmarking for rustdoc search, but it's only run when we make changes to it (and its also mostly handled by @notriddle ).

@notriddle
Copy link
Contributor

There is benchmarking for rustdoc search

https://gitlab.com/notriddle/rustdoc-js-profile is where the current effort is.

@notriddle
Copy link
Contributor

notriddle commented Nov 2, 2024

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Nov 14, 2024
…=GuillaumeGomez

rustdoc: use a trie for name-based search

Potentially rust-lang#131156 — need to try reproducing the problem with `windows`

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them.
- It uses more RAM.
- It's faster (assuming you don't swap thrash).

[^1]: technically utf-16 code units
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Nov 14, 2024
Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez

rustdoc: use a trie for name-based search

Potentially rust-lang#131156 — need to try reproducing the problem with `windows`

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them.
- It uses more RAM.
- It's faster (assuming you don't swap thrash).

[^1]: technically utf-16 code units
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rustdoc-search Area: Rustdoc's search feature C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants