Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Dataset Files for Each Task #1

Open
Shinning-Zhou opened this issue Sep 29, 2024 · 1 comment
Open

Clarification on Dataset Files for Each Task #1

Shinning-Zhou opened this issue Sep 29, 2024 · 1 comment

Comments

@Shinning-Zhou
Copy link

Hi,
In the "Merging for Generative Models" step, I see 20 files uploaded in the finetune dataset link (https://huggingface.co/datasets/lu-vae/natural-dataset/tree/main), but I'm not sure which task each file corresponds to. Can you upload the test dataset configuration file?

Thanks!

@LZY-the-boys
Copy link
Owner

Thanks for your interest.

The finetuning dataset is detailed in the paper Appendix D.2. For MMLU and TruthfulQA, which lack official training sets, we used the Dolly-15k dataset for MMLU and the BigBench-sampled dataset for TruthfulQA. For GSM8k and CNN-DailyMail, we use original training dataset, such as here. I forgot to upload the BigBench dataset, which I will work on shortly.

The test dataset is contained in HELM evaluation framework, we actually have uploaded a subset in here, its source is configured by this file.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants