-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DenseRetrievalExactSearch
evaluation
#154
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -28,6 +28,7 @@ def search(self, | |||
top_k: int, | ||||
score_function: str, | ||||
return_sorted: bool = False, | ||||
ignore_identical_ids: bool = True, | ||||
**kwargs) -> Dict[str, Dict[str, float]]: | ||||
# Create embeddings for all queries using model.encode_queries() | ||||
# Runs semantic search against the corpus embeddings | ||||
|
@@ -45,6 +46,9 @@ def search(self, | |||
logger.info("Sorting Corpus by document length (Longest first)...") | ||||
|
||||
corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True) | ||||
if ignore_identical_ids: | ||||
# We remove the query from results if it exists in corpus | ||||
corpus_ids = [cid for cid in corpus_ids if cid not in query_ids] | ||||
corpus = [corpus[cid] for cid in corpus_ids] | ||||
|
||||
logger.info("Encoding Corpus in batches... Warning: This might take a while!") | ||||
|
@@ -70,21 +74,21 @@ def search(self, | |||
cos_scores[torch.isnan(cos_scores)] = -1 | ||||
|
||||
# Get top-k values | ||||
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k+1, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) | ||||
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) | ||||
Comment on lines
-73
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You write that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, the problem comes from this line
So we only keep the top_k (which sometimes include the query inside the retrieved docs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, I thought the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, then why do we get different results? 🧐 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's easy to check, we just have to assert that number of results of each query is |
||||
cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist() | ||||
cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist() | ||||
|
||||
for query_itr in range(len(query_embeddings)): | ||||
query_id = query_ids[query_itr] | ||||
for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]): | ||||
corpus_id = corpus_ids[corpus_start_idx+sub_corpus_id] | ||||
if corpus_id != query_id: | ||||
if len(result_heaps[query_id]) < top_k: | ||||
# Push item on the heap | ||||
heapq.heappush(result_heaps[query_id], (score, corpus_id)) | ||||
else: | ||||
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element | ||||
heapq.heappushpop(result_heaps[query_id], (score, corpus_id)) | ||||
assert ignore_identical_ids is False or corpus_id != query_id, "Query id and corpus id should not be the same if ignore_identical_ids is set to True" | ||||
if len(result_heaps[query_id]) < top_k: | ||||
# Push item on the heap | ||||
heapq.heappush(result_heaps[query_id], (score, corpus_id)) | ||||
else: | ||||
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element | ||||
heapq.heappushpop(result_heaps[query_id], (score, corpus_id)) | ||||
|
||||
for qid in result_heaps: | ||||
for score, corpus_id in result_heaps[qid]: | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this make the task "easier" by removing all other queries as options for each query?
I.e. previously, given query1 the model could wrongly retrieve query2 (if it was also in the corpus).
Now the model cannot retrieve any of the other queries which makes it easier assuming the answer is never another query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thus option was for Quora: You want to find paraphrases of queries, but not the original start query. But this original query will always be ranked first at it is also part of the corpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is why we have the
ignore_identical_ids
option I think. This PR only tries to fixignore_identical_ids=True
case