-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite search #298
Rewrite search #298
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tremendous improvement, thank you Jonathan! 🎉 The method search is something I've always been craving for.
lib/sdoc/search_index.rb
Outdated
# before long methods of short modules. For example, when searching for | ||
# "find_by", this prioritizes ActiveRecord::FinderMethods#find_by before | ||
# ActiveRecord::Querying#find_by_sql. | ||
entries.last[1] *= 0.95 ** rdoc_object.name.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we favor the position of objects with a more "sophisticated" description as well ?
For instance, I always need to check ActionController::Rendering#render
's documentation but there's a dozen #render
methods that I don't care about, some of them are better ranked but have a very succinct description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea! It's tricky though because the bonus is per bigram instead of per entry. And in cases like ActiveRecord::FinderMethods#find_by
vs ActiveRecord::Querying#find_by_sql
, we definitely want "find_by" to find #find_by
, but #find_by_sql
has much longer documentation.
I've pushed a commit that tries to balance the constraints, but it might be overtuned to the specific examples I used. Please play around with it and let me know if you encounter any odd results!
dae979b
to
d00ca33
Compare
Prior to this commit, SDoc's search algorithm was implemented by [`searcher.js`][]. `searcher.js` builds a regular expression for each token in the query. For example, the query "foo bar" generates the regular expressions `/([f])([^f]*?)([o])([^o]*?)([o])([^o]*?)/i` and `/([b])([^b]*?)([a])([^a]*?)([r])([^r]*?)/i`. These regular expressions fuzzy match missing letters, but fail for any other kind of typo, such as added letters or swapped letters. They can also produce surprising results. For example, the query "ActiveRecord::Base" returns `ActiveRecord::AttributeAssignment` as the top result due to matching "activerecord" attri"b"ute"a"s"s"ignm"e"ent, and there are six(!) other results that appear before `ActiveRecord::Base`. This commit implements a new search algorithm based on character-level bigrams. For example, the query "foo bar" will look for results that match "fo", "oo", "o ", " b", "ba", and "ar". Shorthand bigrams for CamelCase names are also included in the search index. For example, entries containing "ActiveRecord" are also associated with the bigram "ar". Bigrams are weighted such that some contribute more to the match score, and results are ordered by match score. Here are some example queries and their top results with rails/rails@7c65a4b both before and after this commit: * "ActiveRecord::Base" * top result before: `ActiveRecord::AttributeAssignment` * top result after: `ActiveRecord::Base` * "ar base" * top result before: `Rails::Generators::Testing::Behavior::ClassMethods#arguments` * top result after: `ActiveRecord::Base` * "hasmany" * top result before: `ActiveRecord::Associations::ClassMethods#has_and_belongs_to_many` * top result after: `ActiveRecord::Associations::ClassMethods#has_many` * "adress" * top result before: `ActiveSupport::HashWithIndifferentAccess` * top result after: `Mail::Address` * "existance" * top result before: no results * top result after: `Pathname#existence` * "foriegn" * top result before: no results * top result after: `String#foreign_key` This commit also redesigns the presentation of search results. Prior to this commit, result names were cut off at ~43 characters, and result descriptions were cut off at ~53 characters. And result descriptions included headings, further reducing relevant visible text. For example, the visible description for `ActionCable::Connection::Base`, which has the heading "Action Cable Connection Base", was "Action Cable Connection Base For every WebSocket". Result descriptions also included code blocks which were then mangled by `Searchdoc.Panel`'s `stripHTML` function. For example, the description for `ActiveModel::API::new` was ```html <p>Initializes a new model with the given <code>params</code>. <pre><code>class Person include ActiveModel::API attr_accessor ... </code></pre> ``` which was transformed to ``` Initializes a new model with the given params. <codeclass Person include ActiveModel::API attr_accessor ... </pre ``` With this commit, search results now always display the full name. Result descriptions are also fully displayed, including non-link HTML, and are now comprised of (up to) the first 130 characters of the leading paragraph of the RDoc comment. For example, the description of `ActiveModel::API::new` becomes "Initializes a new model with the given <code>params</code>." [`searcher.js`]: https://github.com/ruby/rdoc/blob/v6.5.0/lib/rdoc/generator/template/json_index/js/searcher.js
d00ca33
to
f940e7d
Compare
Prior to this commit, SDoc's search algorithm was implemented by
searcher.js
.searcher.js
builds a regular expression for each token in the query. For example, the query "foo bar" generates the regular expressions/([f])([^f]*?)([o])([^o]*?)([o])([^o]*?)/i
and/([b])([^b]*?)([a])([^a]*?)([r])([^r]*?)/i
. These regular expressions fuzzy match missing letters, but fail for any other kind of typo, such as added letters or swapped letters. They can also produce surprising results. For example, the query "ActiveRecord::Base" returnsActiveRecord::AttributeAssignment
as the top result due to matching "activerecord" attri"b"ute"a"s"s"ignm"e"ent, and there are six(!) other results that appear beforeActiveRecord::Base
.This commit implements a new search algorithm based on character-level bigrams. For example, the query "foo bar" will look for results that match "fo", "oo", "o ", " b", "ba", and "ar". Shorthand bigrams for CamelCase names are also included in the search index. For example, entries containing "ActiveRecord" are also associated with the bigram "ar". Bigrams are weighted such that some contribute more to the match score, and results are ordered by match score.
Here are some example queries and their top results with rails/rails@7c65a4b both before and after this commit:
ActiveRecord::AttributeAssignment
ActiveRecord::Base
Rails::Generators::Testing::Behavior::ClassMethods#arguments
ActiveRecord::Base
ActiveRecord::Associations::ClassMethods#has_and_belongs_to_many
ActiveRecord::Associations::ClassMethods#has_many
ActiveSupport::HashWithIndifferentAccess
Mail::Address
Pathname#existence
String#foreign_key
This commit also redesigns the presentation of search results. Prior to this commit, result names were cut off at ~43 characters, and result descriptions were cut off at ~53 characters. And result descriptions included headings, further reducing relevant visible text. For example, the visible description for
ActionCable::Connection::Base
, which has the heading "Action Cable Connection Base", was "Action Cable Connection Base For every WebSocket". Result descriptions also included code blocks which were then mangled bySearchdoc.Panel
'sstripHTML
function. For example, the description forActiveModel::API::new
waswhich was transformed to
With this commit, search results now always display the full name. Result descriptions are also fully displayed, including non-link HTML, and are now comprised of (up to) the first 130 characters of the leading paragraph of the RDoc comment. For example, the description of
ActiveModel::API::new
becomes "Initializes a new model with the given <code>params</code>."If you find any queries that give unexpected results, please share, and I will see if they can be improved.