Searches that don't work well #482

aslagle · 2017-03-23T14:09:30Z

I'm making this issue to keep track of specific searches that don't work well.

Here's a report from an app review:
"I looked up books by M.J. Rose and it gave me one, but when I looked up specific titles by the author they had them! I looked up Steve Berry, it gave me a few of his books and then others by authors whose names were not even close to it."

jbdalton · 2017-03-24T20:05:37Z

"python programming"

Lots of good results in the first two dozen, then hits a bumpy long tail with matches on titles "Phthor" and "Pythias" sorting above titles like "Think Python: How to think like a Computer," "Python Data Science Handbook," etc.

jbdalton · 2017-03-27T20:04:19Z

Looking more closely at "Steve Berry" and other search examples, I think the following changes, combined or separately tried, might help filter out some of the louder noise in the Elasticsearch results.

Change the fuzzy_prefix_length from 0 to 1.

A fuzzy_prefix_length=1 will stop "tennis" from matching Dennis Rodman or the title "Kenny" (currently ranked at 14 in a "tennis" search). It basically forces an edge-ngram by not including the first character in the ngram analysis.
Change the maximum fuzziness parameter to 1 from its AUTO setting.

This will stop "tennis" from matching on the author "Shyalpa Tenzin Rinpoche." And visa versa, a search for "Tenzin" won't return the non-relevant "tennis," since the Levenshtein distance will be calculated only 1 edit away from the original input for strings larger than 5 characters, instead of the default 2 edits when it is set to AUTO.

Reviewing some of the options for the tokenizers being used when indexing might help weed out other tangential query results. For example, from https://www.elastic.co/blog/found-fuzzy-search:

If one were to use a fuzzy query over an ngram analyzed field, the results would likely be bizarre, as ngrams break words up into many small letter combinations, many of which are only an edit or two away, though the actual words involved are quite dissimilar.

leonardr · 2017-04-14T16:30:37Z

Searches for book titles that include genre names don't work well. For example, searching for "modern romance" turns up several pages of romance books before it turns up the book called Modern Romance. The romance books generally have "modern" in their subtitle and/or series names.

leonardr · 2017-11-16T18:22:53Z

the mountain man, mountain man, and law of the mountain man get dramatically different results

leonardr · 2017-11-28T16:02:47Z

Searching for game of thrones generally gives good results but exposes a number of issues:

'A Game of Thrones: The Graphic Novel, Volume 4' shows up before 'A Game of Thrones', probably because it mentions "game of thrones" in the summary. An 88% title match should be ranked higher than a 33% title match plus a summary match.
Result Verbose materialized views #9, "Skyborn" has subtitle "Thrones and Bones Series, Book 3" and description mentions the word "game" twice. This should not take precedence over a book whose subtitle or description contains the complete phrase "game of thrones".
Result Make AcquisitionFeed.license_tags a class method. If it's passed an o… #10, "Pokemon Black and White Walkthrough,Ultımate Game Guides" seems to show up because both the title and author include the word "game". "thrones" is nowhere to be found, but the result is weighted as an author search.

leonardr · 2017-11-28T16:06:41Z

The first result for girl on the train is a short story called "The Girl in the Train". Stopwords are relevant in an exact title match.

leonardr · 2017-11-28T16:32:09Z

Amy suggests rearchitecting search to use a suggester rather than fixing these problems by changing the search query.

leonardr · 2017-12-01T15:09:09Z

Searching for an age range like "age 3-5" privileges works that fit that age range, but it also introduces titles that match '3' or '5', e.g. in a subtitle that says "X Series Volume 3". This can introduce titles that are not appropriate for the age range. It's worth investigating whether we should short-circuit our usual technique of treating a search query like "age 3-5" or "romance" as either an advanced search directive or a normal full-text search term.

leonardr · 2018-01-10T17:38:45Z

I'm in a position where I can directly compare the new search algorithm to the old one, so I'm giving an update on how we're doing so far and how much of the improvement is due to the new algorithm.

m. j. rose and m j rose find numerous works by M. J. Rose, but m.j. rose and mj rose only find works with 'rose' in the title. This one is greatly improved but needs a little tweak, I think.

steve berry privileges every work by Steve Berry over a work by anyone else. This one looks solved.

python programming still seems to have a long-tail problem, with books about Python programming beginning to be mixed in with other books like "Burmese Pythons" and "Chthon" in the twenties. The new algorithm performs slightly better than the old one, though that might be due to sharding differences.

tennis still gets "Kenny", "Tenzin" and other bad fuzzy matches.

In the new algorithm, modern romance privileges two Romance titles in the 'Modern Battles' series over "Modern Romance", but "Modern Romance" is the third result, and it's nowhere to be seen in the old algorithm.

In the new algorithm, the mountain man privileges works with "the mountain man" in the title over works with "mountain man" in the series, which is better than the old algorithm, which worked the other way around.

Both algorithms treat mountain man more or less the same, privileging works with "mountain man" in the title.

The old algorithm handles law of the mountain man by putting up 19 bad results (classified under Law) and then the title "Law of the Mountain Man". The new algorithm handles law of the mountain man by putting "Law of the Mountain Man" first with the 19 bad results under it. This is a significant improvement but most people would expect other "Mountain Man" titles to occupy the subsequent results.

Both algorithms perform well on a game of thrones. The new algorithm has the ebook of "A Game of Thrones" as the first result, instead of one of the graphic novels, but for some reason doesn't include the graphic novel at all. Both algorithms start introducing irrelevant works around result #10.

The new algorithm performs significantly worse on girl on the train to the point where I suspect there's some other problem with the dataset. Neither "The Girl In the Train" nor "The Girl On the Train" show up.

leonardr · 2018-01-10T18:54:39Z

Rebuilding my test search index fixes a game of thrones and girl on the train. The GoT graphic novel now shows up in position #9, just above the Pokemon guide, and "The Girl on the Train" now shows up as well as (and above) "The Girl in the Train" and "The Girl from the Train".

leonardr · 2018-01-10T19:01:28Z

In general, if you search for a specific title/author/series, you now get a bunch of really good results and then suddenly the results become awful. There is an abrupt dropoff in result quality.

If you search for a topic like "python programming" then you start off with good results and eventually start seeing poor-quality results start being merged into the good results. There's not a gradual dropoff in result quality.

If you search for a single word you're likely to get fuzz errors.

So overall, an improvement, but obviously not the last word in search.

leonardr · 2018-01-10T19:17:36Z

I figured out why I had the missing items in the index; suffice to say the items were missing from the index, but the cause was user error, not a problem in the code, and rebuilding the index was the right thing to do.

leonardr · 2018-03-16T19:09:37Z

cf. #443

leonardr · 2018-03-16T19:12:28Z

cf. #159

leonardr mentioned this issue Dec 1, 2017

Full title match is better than split match #739

Merged

leonardr mentioned this issue Mar 16, 2018

Investigate improving search for "the girls" #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searches that don't work well #482

Searches that don't work well #482

aslagle commented Mar 23, 2017

jbdalton commented Mar 24, 2017

jbdalton commented Mar 27, 2017

leonardr commented Apr 14, 2017 •

edited

Loading

leonardr commented Nov 16, 2017

leonardr commented Nov 28, 2017 •

edited

Loading

leonardr commented Nov 28, 2017

leonardr commented Nov 28, 2017

leonardr commented Dec 1, 2017

leonardr commented Jan 10, 2018 •

edited

Loading

leonardr commented Jan 10, 2018

leonardr commented Jan 10, 2018 •

edited

Loading

leonardr commented Jan 10, 2018

leonardr commented Mar 16, 2018

leonardr commented Mar 16, 2018

Searches that don't work well #482

Searches that don't work well #482

Comments

aslagle commented Mar 23, 2017

jbdalton commented Mar 24, 2017

jbdalton commented Mar 27, 2017

leonardr commented Apr 14, 2017 • edited Loading

leonardr commented Nov 16, 2017

leonardr commented Nov 28, 2017 • edited Loading

leonardr commented Nov 28, 2017

leonardr commented Nov 28, 2017

leonardr commented Dec 1, 2017

leonardr commented Jan 10, 2018 • edited Loading

leonardr commented Jan 10, 2018

leonardr commented Jan 10, 2018 • edited Loading

leonardr commented Jan 10, 2018

leonardr commented Mar 16, 2018

leonardr commented Mar 16, 2018

leonardr commented Apr 14, 2017 •

edited

Loading

leonardr commented Nov 28, 2017 •

edited

Loading

leonardr commented Jan 10, 2018 •

edited

Loading

leonardr commented Jan 10, 2018 •

edited

Loading