Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DenseRetrievalExactSearch evaluation #154

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions beir/retrieval/search/dense/exact_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def search(self,
top_k: int,
score_function: str,
return_sorted: bool = False,
ignore_identical_ids: bool = True,
**kwargs) -> Dict[str, Dict[str, float]]:
# Create embeddings for all queries using model.encode_queries()
# Runs semantic search against the corpus embeddings
Expand All @@ -45,6 +46,9 @@ def search(self,
logger.info("Sorting Corpus by document length (Longest first)...")

corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True)
if ignore_identical_ids:
# We remove the query from results if it exists in corpus
corpus_ids = [cid for cid in corpus_ids if cid not in query_ids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this make the task "easier" by removing all other queries as options for each query?

I.e. previously, given query1 the model could wrongly retrieve query2 (if it was also in the corpus).
Now the model cannot retrieve any of the other queries which makes it easier assuming the answer is never another query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thus option was for Quora: You want to find paraphrases of queries, but not the original start query. But this original query will always be ranked first at it is also part of the corpus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why we have the ignore_identical_ids option I think. This PR only tries to fix ignore_identical_ids=True case

corpus = [corpus[cid] for cid in corpus_ids]

logger.info("Encoding Corpus in batches... Warning: This might take a while!")
Expand All @@ -70,21 +74,21 @@ def search(self,
cos_scores[torch.isnan(cos_scores)] = -1

# Get top-k values
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k+1, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted)
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted)
Comment on lines -73 to +77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You write that Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents., but didn't this +1 ensure that that does not happen cuz we retrieve top_k+1 but then only allow top_k lateron?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the problem comes from this line

if len(result_heaps[query_id]) < top_k:

So we only keep the top_k (which sometimes include the query inside the retrieved docs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I thought the if corpus_id != query_id: would ensure that the query would never be added to result_heaps[query_id] 🧐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then why do we get different results? 🧐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to check, we just have to assert that number of results of each query is top_k. Can you check that please @Muennighoff ?

cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist()
cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist()

for query_itr in range(len(query_embeddings)):
query_id = query_ids[query_itr]
for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
corpus_id = corpus_ids[corpus_start_idx+sub_corpus_id]
if corpus_id != query_id:
if len(result_heaps[query_id]) < top_k:
# Push item on the heap
heapq.heappush(result_heaps[query_id], (score, corpus_id))
else:
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
heapq.heappushpop(result_heaps[query_id], (score, corpus_id))
assert ignore_identical_ids is False or corpus_id != query_id, "Query id and corpus id should not be the same if ignore_identical_ids is set to True"
if len(result_heaps[query_id]) < top_k:
# Push item on the heap
heapq.heappush(result_heaps[query_id], (score, corpus_id))
else:
# If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
heapq.heappushpop(result_heaps[query_id], (score, corpus_id))

for qid in result_heaps:
for score, corpus_id in result_heaps[qid]:
Expand Down