how was the hotpot_qa dataset preprocessed? #181

DanielSchuhmacher · 2024-09-02T14:12:37Z

I am curious how you created the list of documents (the corpus). The original hotpot_qa does not come with that list of documents. Instead for each query it comes with a list of only 10 documents - 2 documents with the content for the gold answer and 8 distractor documents. My current assumption is the following. you took the distractor dataset and extracted the documents for all queries to build the corpus. The 2 gold documents in the original hotpotqa were then marked as the relevant documents for a specific query.

Please let me know how it works, since this confuses me quite a lot. Thank you very much!

If my assumption is correct you could also have a look at the multi-hop-rag dataset which was specifically created in that format already (corpus is seperated from the query and answer). The documents are also longer, which I think is a more realistic use case for a retrieval system, specially a RAG system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how was the hotpot_qa dataset preprocessed? #181

how was the hotpot_qa dataset preprocessed? #181

DanielSchuhmacher commented Sep 2, 2024 •

edited

Loading

how was the hotpot_qa dataset preprocessed? #181

how was the hotpot_qa dataset preprocessed? #181

Comments

DanielSchuhmacher commented Sep 2, 2024 • edited Loading

DanielSchuhmacher commented Sep 2, 2024 •

edited

Loading