Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removal_least_used parameter is improperly used in document-similarity-s1-rank_filter #430

Open
marekhorst opened this issue Dec 4, 2018 · 0 comments
Assignees

Comments

@marekhorst
Copy link
Member

marekhorst commented Dec 4, 2018

After comparing two different ranking scripts: document-similarity-s1-rank_filter.pig and document-similarity-s1-ship-rank_filter.pig and deeper inspection of the document-similarity-s1-rank_filter.pig script it seems the removal_least_used is improperly used: it should be compared against the number of referenced docs ($1) instead of the rank position ($0).

Currently far less terms are filtered out because of this bug. In most cases only terms referenced once are discarded because the rank index is not dense and there are almost always more than 20 terms with single document reference. In current OpenAIRE documents similarity configuration removal_least_used was set to 20 so all the terms referenced in less than 20 documents should be filtered out.

@marekhorst marekhorst self-assigned this Dec 4, 2018
marekhorst added a commit that referenced this issue Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant