Skip to content
This repository has been archived by the owner on Apr 11, 2022. It is now read-only.

Searches that don't work well #482

Open
aslagle opened this issue Mar 23, 2017 · 14 comments
Open

Searches that don't work well #482

aslagle opened this issue Mar 23, 2017 · 14 comments

Comments

@aslagle
Copy link
Collaborator

aslagle commented Mar 23, 2017

I'm making this issue to keep track of specific searches that don't work well.

Here's a report from an app review:
"I looked up books by M.J. Rose and it gave me one, but when I looked up specific titles by the author they had them! I looked up Steve Berry, it gave me a few of his books and then others by authors whose names were not even close to it."

@jbdalton
Copy link

"python programming"

Lots of good results in the first two dozen, then hits a bumpy long tail with matches on titles "Phthor" and "Pythias" sorting above titles like "Think Python: How to think like a Computer," "Python Data Science Handbook," etc.

@jbdalton
Copy link

Looking more closely at "Steve Berry" and other search examples, I think the following changes, combined or separately tried, might help filter out some of the louder noise in the Elasticsearch results.

  • Change the fuzzy_prefix_length from 0 to 1.

    A fuzzy_prefix_length=1 will stop "tennis" from matching Dennis Rodman or the title "Kenny" (currently ranked at 14 in a "tennis" search). It basically forces an edge-ngram by not including the first character in the ngram analysis.

  • Change the maximum fuzziness parameter to 1 from its AUTO setting.

    This will stop "tennis" from matching on the author "Shyalpa Tenzin Rinpoche." And visa versa, a search for "Tenzin" won't return the non-relevant "tennis," since the Levenshtein distance will be calculated only 1 edit away from the original input for strings larger than 5 characters, instead of the default 2 edits when it is set to AUTO.

Reviewing some of the options for the tokenizers being used when indexing might help weed out other tangential query results. For example, from https://www.elastic.co/blog/found-fuzzy-search:

If one were to use a fuzzy query over an ngram analyzed field, the results would likely be bizarre, as ngrams break words up into many small letter combinations, many of which are only an edit or two away, though the actual words involved are quite dissimilar.

@leonardr
Copy link
Contributor

leonardr commented Apr 14, 2017

Searches for book titles that include genre names don't work well. For example, searching for "modern romance" turns up several pages of romance books before it turns up the book called Modern Romance. The romance books generally have "modern" in their subtitle and/or series names.

@leonardr
Copy link
Contributor

the mountain man, mountain man, and law of the mountain man get dramatically different results

@leonardr
Copy link
Contributor

leonardr commented Nov 28, 2017

Searching for game of thrones generally gives good results but exposes a number of issues:

  • 'A Game of Thrones: The Graphic Novel, Volume 4' shows up before 'A Game of Thrones', probably because it mentions "game of thrones" in the summary. An 88% title match should be ranked higher than a 33% title match plus a summary match.
  • Result Verbose materialized views #9, "Skyborn" has subtitle "Thrones and Bones Series, Book 3" and description mentions the word "game" twice. This should not take precedence over a book whose subtitle or description contains the complete phrase "game of thrones".
  • Result Make AcquisitionFeed.license_tags a class method. If it's passed an o… #10, "Pokemon Black and White Walkthrough,Ultımate Game Guides" seems to show up because both the title and author include the word "game". "thrones" is nowhere to be found, but the result is weighted as an author search.

@leonardr
Copy link
Contributor

The first result for girl on the train is a short story called "The Girl in the Train". Stopwords are relevant in an exact title match.

@leonardr
Copy link
Contributor

Amy suggests rearchitecting search to use a suggester rather than fixing these problems by changing the search query.

@leonardr
Copy link
Contributor

leonardr commented Dec 1, 2017

Searching for an age range like "age 3-5" privileges works that fit that age range, but it also introduces titles that match '3' or '5', e.g. in a subtitle that says "X Series Volume 3". This can introduce titles that are not appropriate for the age range. It's worth investigating whether we should short-circuit our usual technique of treating a search query like "age 3-5" or "romance" as either an advanced search directive or a normal full-text search term.

@leonardr
Copy link
Contributor

leonardr commented Jan 10, 2018

I'm in a position where I can directly compare the new search algorithm to the old one, so I'm giving an update on how we're doing so far and how much of the improvement is due to the new algorithm.

m. j. rose and m j rose find numerous works by M. J. Rose, but m.j. rose and mj rose only find works with 'rose' in the title. This one is greatly improved but needs a little tweak, I think.

steve berry privileges every work by Steve Berry over a work by anyone else. This one looks solved.

python programming still seems to have a long-tail problem, with books about Python programming beginning to be mixed in with other books like "Burmese Pythons" and "Chthon" in the twenties. The new algorithm performs slightly better than the old one, though that might be due to sharding differences.

tennis still gets "Kenny", "Tenzin" and other bad fuzzy matches.

In the new algorithm, modern romance privileges two Romance titles in the 'Modern Battles' series over "Modern Romance", but "Modern Romance" is the third result, and it's nowhere to be seen in the old algorithm.

In the new algorithm, the mountain man privileges works with "the mountain man" in the title over works with "mountain man" in the series, which is better than the old algorithm, which worked the other way around.

Both algorithms treat mountain man more or less the same, privileging works with "mountain man" in the title.

The old algorithm handles law of the mountain man by putting up 19 bad results (classified under Law) and then the title "Law of the Mountain Man". The new algorithm handles law of the mountain man by putting "Law of the Mountain Man" first with the 19 bad results under it. This is a significant improvement but most people would expect other "Mountain Man" titles to occupy the subsequent results.

Both algorithms perform well on a game of thrones. The new algorithm has the ebook of "A Game of Thrones" as the first result, instead of one of the graphic novels, but for some reason doesn't include the graphic novel at all. Both algorithms start introducing irrelevant works around result #10.

The new algorithm performs significantly worse on girl on the train to the point where I suspect there's some other problem with the dataset. Neither "The Girl In the Train" nor "The Girl On the Train" show up.

@leonardr
Copy link
Contributor

Rebuilding my test search index fixes a game of thrones and girl on the train. The GoT graphic novel now shows up in position #9, just above the Pokemon guide, and "The Girl on the Train" now shows up as well as (and above) "The Girl in the Train" and "The Girl from the Train".

@leonardr
Copy link
Contributor

leonardr commented Jan 10, 2018

In general, if you search for a specific title/author/series, you now get a bunch of really good results and then suddenly the results become awful. There is an abrupt dropoff in result quality.

If you search for a topic like "python programming" then you start off with good results and eventually start seeing poor-quality results start being merged into the good results. There's not a gradual dropoff in result quality.

If you search for a single word you're likely to get fuzz errors.

So overall, an improvement, but obviously not the last word in search.

@leonardr
Copy link
Contributor

I figured out why I had the missing items in the index; suffice to say the items were missing from the index, but the cause was user error, not a problem in the code, and rebuilding the index was the right thing to do.

@leonardr
Copy link
Contributor

cf. #443

@leonardr
Copy link
Contributor

cf. #159

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants