You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently far less terms are filtered out because of this bug. In most cases only terms referenced once are discarded because the rank index is not dense and there are almost always more than 20 terms with single document reference. In current OpenAIRE documents similarity configuration removal_least_used was set to 20 so all the terms referenced in less than 20 documents should be filtered out.
The text was updated successfully, but these errors were encountered:
After comparing two different ranking scripts: document-similarity-s1-rank_filter.pig and document-similarity-s1-ship-rank_filter.pig and deeper inspection of the document-similarity-s1-rank_filter.pig script it seems the
removal_least_used
is improperly used: it should be compared against the number of referenced docs ($1
) instead of the rank position ($0
).Currently far less terms are filtered out because of this bug. In most cases only terms referenced once are discarded because the rank index is not dense and there are almost always more than 20 terms with single document reference. In current OpenAIRE documents similarity configuration
removal_least_used
was set to 20 so all the terms referenced in less than 20 documents should be filtered out.The text was updated successfully, but these errors were encountered: