Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from bigrams to trigrams for search #342

Merged
merged 3 commits into from
Oct 25, 2023

Conversation

jonathanhefner
Copy link
Member

Trigrams can provide more accurate search results than bigrams. For example, using bigrams, searching for "sel" would attempt to match the ngrams " s", "se", and "el". For the Rails API (at 7c65a4b83b583f4f), the top result is ActiveModel::Serializers due to "Model" matching "el" and ":Serial" matching " s" and "se". However, using trigrams, "sel" would attempt to match " se" and "sel". In that case, for the Rails API, the top result is ActiveRecord::QueryMethods#select.

The downside to using trigrams is that the search index increases from 2.9 MB to 8.6 MB. But the data compresses well, so when gzipped the size only increases from 474 kB to 670 kB. And browser heap snapshot size stays reasonably small, increasing from 6.8 MB to 11.1 MB in Firefox and 8.0 MB to 22.2 MB in Chrome.

This uses `Uint8Array` to represent byte arrays in the search index,
reducing heap snapshot size in Firefox from 12.2 MB to 6.8 MB.  Though
Chrome appears to have a similar optimization already built in, so its
heap snapshot marginally increases from 6.1 MB to 8.0 MB.
This is in preparation for switching from bigrams to trigrams, reducing
the size of the subsequent diff.
Trigrams can provide more accurate search results than bigrams.  For
example, using bigrams, searching for "sel" would attempt to match the
ngrams " s", "se", and "el".  For the Rails API (at `7c65a4b83b583f4f`),
the top result is `ActiveModel::Serializers` due to "Model" matching
"el" and ":Serial" matching " s" and "se".  However, using trigrams,
"sel" would attempt to match " se" and "sel".  In that case, for the
Rails API, the top result is `ActiveRecord::QueryMethods#select`.

The downside to using trigrams is that the search index increases from
2.9 MB to 8.6 MB.  But the data compresses well, so when gzipped the
size only increases from 474 kB to 670 kB.  And browser heap snapshot
size stays reasonably small, increasing from 6.8 MB to 11.1 MB in
Firefox and 8.0 MB to 22.2 MB in Chrome.
@jonathanhefner jonathanhefner merged commit de49d9e into rails:main Oct 25, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant