Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PubMedQA dataset #740

Merged
merged 26 commits into from
Jan 21, 2025
Merged

Add PubMedQA dataset #740

merged 26 commits into from
Jan 21, 2025

Conversation

kurbanrita
Copy link
Collaborator

@kurbanrita kurbanrita commented Dec 27, 2024

Add PubMedQA dataset

PubMedQA is a biomedical question-answering dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe. The subset that is used for validation has 1k expert-annotated QA instances.

  • Add LanguageDataset
  • Add TextClassification dataset
  • Add PubMedQA dataset that is loaded from HuggingFace (incl. add HuggingFace datasets as a new language dependency)
  • Add tests

Remaining questions:

  • Vision validators could probably be reused for language. Maybe refactor some common functions into core?
  • The PR won't be merged to main, the development will continue on this branch.

@kurbanrita kurbanrita self-assigned this Dec 27, 2024
@kurbanrita kurbanrita requested a review from ioangatop December 30, 2024 13:54
@kurbanrita kurbanrita marked this pull request as ready for review December 30, 2024 14:01
Copy link
Collaborator

@ioangatop ioangatop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @kurbanrita 🎉 here are some initial comments

src/eva/language/data/datasets/classification/base.py Outdated Show resolved Hide resolved
src/eva/language/data/datasets/language.py Outdated Show resolved Hide resolved
src/eva/language/data/datasets/classification/pubmedqa.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ioangatop ioangatop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments 🤗

@kurbanrita kurbanrita changed the base branch from main to language January 17, 2025 16:23
@kurbanrita kurbanrita requested a review from ioangatop January 17, 2025 16:24
Copy link
Collaborator

@nkaenzig nkaenzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @kurbanrita! Left a couple of nitpicks, after those I think this should be ready to go.

src/eva/language/data/datasets/language.py Outdated Show resolved Hide resolved
src/eva/language/data/datasets/classification/pubmedqa.py Outdated Show resolved Hide resolved
src/eva/language/data/datasets/classification/pubmedqa.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ioangatop ioangatop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start @kurbanrita 🎉 Lets address @nkaenzig comments and merge :D

@kurbanrita kurbanrita merged commit f84881c into language Jan 21, 2025
4 checks passed
@kurbanrita kurbanrita deleted the nlp_integration branch January 21, 2025 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants