LLM Data

LLM Data
- Survey
- LLM Data
- Multi Modal
- Alignment
- Synthetic
- Reasoning
- Action
- Toolkits
- Misc

Survey

A Survey on Data Synthesis and Augmentation for Large Language Models, arXiv, 2410.12896, arxiv, pdf, cication: -1

Ke Wang, Jiahui Zhu, Minjie Ren, ..., Qingjie Liu, Yunhong Wang

LLM Data

Improving Foundation Models Using Expert Human Data
Bridging the Data Provenance Gap Across Text, Speech and Video, arXiv, 2412.17847, arxiv, pdf, cication: -1

Shayne Longpre, Nikhil Singh, Manuel Cherep, ..., Sara Hooker, Jad Kabbara
FineMath consists of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. 🤗

· (𝕏)
🌟 fineweb-2 - huggingface

· (huggingface)
MultimodalUniverse - MultimodalUniverse

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data · (𝕏)
Zyda-2: a 5 Trillion Token High-Quality Dataset, arXiv, 2411.06068, arxiv, pdf, cication: -1

Yury Tokpanov, Paolo Glorioso, Quentin Anthony, ..., Beren Millidge · (huggingface)
dolma - allenai

an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
🌟 RedPajama: an Open Dataset for Training Large Language Models, arXiv, 2411.12372, arxiv, pdf, cication: -1

Maurice Weber, Daniel Fu, Quentin Anthony, ..., Irina Rish, Ce Zhang
🌟 smollm - huggingface

135M, 360M, and 1.7B parameters. · (smollm - huggingface) · (𝕏) · (huggingface)
Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. 🤗
Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens 🤗
Releasing the largest multilingual open pretraining dataset 🤗
🌟 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination, arXiv, 2411.03823, arxiv, pdf, cication: -1

Dingjie Song, Sicheng Lai, Shunian Chen, ..., Lichao Sun, Benyou Wang · (MM-Detect - MLLM-Data-Contamination)
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction, arXiv, 2410.21169, arxiv, pdf, cication: -1

Qintong Zhang, Victor Shea-Jay Huang, Bin Wang, ..., Conghui He, Wentao Zhang
Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. 🤗

· (arxiv)
Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. 🤗
Compute-Constrained Data Selection, arXiv, 2410.16208, arxiv, pdf, cication: -1

Junjie Oscar Yin, Alexander M. Rush

Multi Modal

🌟 A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. 🤗
PangeaIns is a 6M multilingual multicultural multimodal instruction tuning dataset spanning 39 languages. 🤗

· (Pangea - neulab)

Alignment

Open-O1 - Open-Source-O1

A Model Matching Proprietary Power with Open-Source Innovation
🌟 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, arXiv, 2406.08464, arxiv, pdf, cication: -1

Zhangchen Xu, Fengqing Jiang, Luyao Niu, ..., Yejin Choi, Bill Yuchen Lin · (huggingface) · (magpie - magpie-align)
Tulu V2.5 Suite updated 6 days ago A suite of models trained using DPO and PPO across a wide variety (up to 14) of preference datasets. 🤗
Llama 3.1 Tulu 3 70B Preference Mixture 🤗
Tulu 3 SFT Mixture 🤗
LongReward: Improving Long-context Large Language Models with AI Feedback, arXiv, 2410.21252, arxiv, pdf, cication: -1

Jiajie Zhang, Zhongni Hou, Xin Lv, ..., Ling Feng, Juanzi Li · (LongReward - THUDM) · (huggingface)

Synthetic

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024] 🎬
How to Synthesize Text Data without Model Collapse?, arXiv, 2412.14689, arxiv, pdf, cication: -1

Xuekai Zhu, Daixuan Cheng, Hengli Li, ..., Zilong Zheng, Bowen Zhou
Evaluating Language Models as Synthetic Data Generators, arXiv, 2412.03679, arxiv, pdf, cication: -1

Seungone Kim, Juyoung Suk, Xiang Yue, ..., Sean Welleck, Graham Neubig
WizardArena: Post-training Large Language Models via Simulated Offline Chatbot Aren

· (arxiv)
synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. 🤗
🌟 WizardLM: Empowering Large Language Models to Follow Complex Instructions, arXiv, 2304.12244, arxiv, pdf, cication: -1

Can Xu, Qingfeng Sun, Kai Zheng, ..., Chongyang Tao, Daxin Jiang
Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning, arXiv, 2410.19290, arxiv, pdf, cication: -1

Yujian Liu, Shiyu Chang, Tommi Jaakkola, ..., Yang Zhang · (Prereq_tune.git - UCSB-NLP-Chang)
promptwright - StacklokLabs

· (reddit)
Scaling Synthetic Data Creation with 1,000,000,000 Personas, arXiv, 2406.20094, arxiv, pdf, cication: 17

Tao Ge, Xin Chan, Xiaoyang Wang, ..., Haitao Mi, Dong Yu
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch, arXiv, 2410.18693, arxiv, pdf, cication: -1

Yuyang Ding, Xinyu Shi, Xiaobo Liang, ..., Qiaoming Zhu, Min Zhang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_data.md

llm_data.md

LLM Data

Survey

LLM Data

Multi Modal

Alignment

Synthetic

Reasoning

Action

Toolkits

Misc

Files

llm_data.md

Latest commit

History

llm_data.md

File metadata and controls

LLM Data

Survey

LLM Data

Multi Modal

Alignment

Synthetic

Reasoning

Action

Toolkits

Misc