Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how was the hotpot_qa dataset preprocessed? #181

Open
DanielSchuhmacher opened this issue Sep 2, 2024 · 0 comments
Open

how was the hotpot_qa dataset preprocessed? #181

DanielSchuhmacher opened this issue Sep 2, 2024 · 0 comments

Comments

@DanielSchuhmacher
Copy link

DanielSchuhmacher commented Sep 2, 2024

I am curious how you created the list of documents (the corpus). The original hotpot_qa does not come with that list of documents. Instead for each query it comes with a list of only 10 documents - 2 documents with the content for the gold answer and 8 distractor documents. My current assumption is the following. you took the distractor dataset and extracted the documents for all queries to build the corpus. The 2 gold documents in the original hotpotqa were then marked as the relevant documents for a specific query.

Please let me know how it works, since this confuses me quite a lot. Thank you very much!

If my assumption is correct you could also have a look at the multi-hop-rag dataset which was specifically created in that format already (corpus is seperated from the query and answer). The documents are also longer, which I think is a more realistic use case for a retrieval system, specially a RAG system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant