Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite search #298

Merged
merged 1 commit into from
Sep 9, 2023
Merged

Rewrite search #298

merged 1 commit into from
Sep 9, 2023

Conversation

jonathanhefner
Copy link
Member

@jonathanhefner jonathanhefner commented Sep 1, 2023

Prior to this commit, SDoc's search algorithm was implemented by searcher.js. searcher.js builds a regular expression for each token in the query. For example, the query "foo bar" generates the regular expressions /([f])([^f]*?)([o])([^o]*?)([o])([^o]*?)/i and /([b])([^b]*?)([a])([^a]*?)([r])([^r]*?)/i. These regular expressions fuzzy match missing letters, but fail for any other kind of typo, such as added letters or swapped letters. They can also produce surprising results. For example, the query "ActiveRecord::Base" returns ActiveRecord::AttributeAssignment as the top result due to matching "activerecord" attri"b"ute"a"s"s"ignm"e"ent, and there are six(!) other results that appear before ActiveRecord::Base.

This commit implements a new search algorithm based on character-level bigrams. For example, the query "foo bar" will look for results that match "fo", "oo", "o ", " b", "ba", and "ar". Shorthand bigrams for CamelCase names are also included in the search index. For example, entries containing "ActiveRecord" are also associated with the bigram "ar". Bigrams are weighted such that some contribute more to the match score, and results are ordered by match score.

Here are some example queries and their top results with rails/rails@7c65a4b both before and after this commit:

  • "ActiveRecord::Base"
    • top result before: ActiveRecord::AttributeAssignment
    • top result after: ActiveRecord::Base
  • "ar base"
    • top result before: Rails::Generators::Testing::Behavior::ClassMethods#arguments
    • top result after: ActiveRecord::Base
  • "hasmany"
    • top result before: ActiveRecord::Associations::ClassMethods#has_and_belongs_to_many
    • top result after: ActiveRecord::Associations::ClassMethods#has_many
  • "adress"
    • top result before: ActiveSupport::HashWithIndifferentAccess
    • top result after: Mail::Address
  • "existance"
    • top result before: no results
    • top result after: Pathname#existence
  • "foriegn"
    • top result before: no results
    • top result after: String#foreign_key

This commit also redesigns the presentation of search results. Prior to this commit, result names were cut off at ~43 characters, and result descriptions were cut off at ~53 characters. And result descriptions included headings, further reducing relevant visible text. For example, the visible description for ActionCable::Connection::Base, which has the heading "Action Cable Connection Base", was "Action Cable Connection Base For every WebSocket". Result descriptions also included code blocks which were then mangled by Searchdoc.Panel's stripHTML function. For example, the description for ActiveModel::API::new was

<p>Initializes a new model with the given <code>params</code>.

<pre><code>class Person
  include ActiveModel::API
  attr_accessor ...
</code></pre>

which was transformed to

Initializes a new model with the given params.

<codeclass Person
  include ActiveModel::API
  attr_accessor ...
</pre

With this commit, search results now always display the full name. Result descriptions are also fully displayed, including non-link HTML, and are now comprised of (up to) the first 130 characters of the leading paragraph of the RDoc comment. For example, the description of ActiveModel::API::new becomes "Initializes a new model with the given <code>params</code>."


If you find any queries that give unexpected results, please share, and I will see if they can be improved.

Copy link
Member

@robin850 robin850 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tremendous improvement, thank you Jonathan! 🎉 The method search is something I've always been craving for.

# before long methods of short modules. For example, when searching for
# "find_by", this prioritizes ActiveRecord::FinderMethods#find_by before
# ActiveRecord::Querying#find_by_sql.
entries.last[1] *= 0.95 ** rdoc_object.name.length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we favor the position of objects with a more "sophisticated" description as well ?

For instance, I always need to check ActionController::Rendering#render's documentation but there's a dozen #render methods that I don't care about, some of them are better ranked but have a very succinct description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea! It's tricky though because the bonus is per bigram instead of per entry. And in cases like ActiveRecord::FinderMethods#find_by vs ActiveRecord::Querying#find_by_sql, we definitely want "find_by" to find #find_by, but #find_by_sql has much longer documentation.

I've pushed a commit that tries to balance the constraints, but it might be overtuned to the specific examples I used. Please play around with it and let me know if you encounter any odd results!

@jonathanhefner jonathanhefner force-pushed the search-rewrite branch 3 times, most recently from dae979b to d00ca33 Compare September 9, 2023 22:15
Prior to this commit, SDoc's search algorithm was implemented by
[`searcher.js`][].  `searcher.js` builds a regular expression for each
token in the query.  For example, the query "foo bar" generates the
regular expressions `/([f])([^f]*?)([o])([^o]*?)([o])([^o]*?)/i` and
`/([b])([^b]*?)([a])([^a]*?)([r])([^r]*?)/i`.  These regular expressions
fuzzy match missing letters, but fail for any other kind of typo, such
as added letters or swapped letters.  They can also produce surprising
results.  For example, the query "ActiveRecord::Base" returns
`ActiveRecord::AttributeAssignment` as the top result due to matching
"activerecord" attri"b"ute"a"s"s"ignm"e"ent, and there are six(!) other
results that appear before `ActiveRecord::Base`.

This commit implements a new search algorithm based on character-level
bigrams.  For example, the query "foo bar" will look for results that
match "fo", "oo", "o ", " b", "ba", and "ar".  Shorthand bigrams for
CamelCase names are also included in the search index.  For example,
entries containing "ActiveRecord" are also associated with the bigram
"ar".  Bigrams are weighted such that some contribute more to the match
score, and results are ordered by match score.

Here are some example queries and their top results with
rails/rails@7c65a4b both before and
after this commit:

* "ActiveRecord::Base"
  * top result before: `ActiveRecord::AttributeAssignment`
  * top result after: `ActiveRecord::Base`

* "ar base"
  * top result before: `Rails::Generators::Testing::Behavior::ClassMethods#arguments`
  * top result after: `ActiveRecord::Base`

* "hasmany"
  * top result before: `ActiveRecord::Associations::ClassMethods#has_and_belongs_to_many`
  * top result after: `ActiveRecord::Associations::ClassMethods#has_many`

* "adress"
  * top result before: `ActiveSupport::HashWithIndifferentAccess`
  * top result after: `Mail::Address`

* "existance"
  * top result before: no results
  * top result after: `Pathname#existence`

* "foriegn"
  * top result before: no results
  * top result after: `String#foreign_key`

This commit also redesigns the presentation of search results.  Prior to
this commit, result names were cut off at ~43 characters, and result
descriptions were cut off at ~53 characters.  And result descriptions
included headings, further reducing relevant visible text.  For example,
the visible description for `ActionCable::Connection::Base`, which has
the heading "Action Cable Connection Base", was "Action Cable Connection
Base For every WebSocket".  Result descriptions also included code
blocks which were then mangled by `Searchdoc.Panel`'s `stripHTML`
function.  For example, the description for `ActiveModel::API::new` was

  ```html
  <p>Initializes a new model with the given <code>params</code>.

  <pre><code>class Person
    include ActiveModel::API
    attr_accessor ...
  </code></pre>
  ```

which was transformed to

  ```
  Initializes a new model with the given params.

  <codeclass Person
    include ActiveModel::API
    attr_accessor ...
  </pre
  ```

With this commit, search results now always display the full name.
Result descriptions are also fully displayed, including non-link HTML,
and are now comprised of (up to) the first 130 characters of the leading
paragraph of the RDoc comment.  For example, the description of
`ActiveModel::API::new` becomes "Initializes a new model with the given
<code>params</code>."

[`searcher.js`]: https://github.com/ruby/rdoc/blob/v6.5.0/lib/rdoc/generator/template/json_index/js/searcher.js
@jonathanhefner jonathanhefner marked this pull request as ready for review September 9, 2023 22:22
@jonathanhefner jonathanhefner merged commit 6a202e1 into rails:main Sep 9, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants