-
A Survey on Data Synthesis and Augmentation for Large Language Models,
arXiv, 2410.12896
, arxiv, pdf, cication: -1Ke Wang, Jiahui Zhu, Minjie Ren, ..., Qingjie Liu, Yunhong Wang
-
Bridging the Data Provenance Gap Across Text, Speech and Video,
arXiv, 2412.17847
, arxiv, pdf, cication: -1Shayne Longpre, Nikhil Singh, Manuel Cherep, ..., Sara Hooker, Jad Kabbara
-
· (𝕏)
-
🌟 fineweb-2 - huggingface
· (huggingface)
-
MultimodalUniverse - MultimodalUniverse
Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data · (𝕏)
-
Zyda-2: a 5 Trillion Token High-Quality Dataset,
arXiv, 2411.06068
, arxiv, pdf, cication: -1Yury Tokpanov, Paolo Glorioso, Quentin Anthony, ..., Beren Millidge · (huggingface)
-
dolma - allenai
an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
-
🌟 RedPajama: an Open Dataset for Training Large Language Models,
arXiv, 2411.12372
, arxiv, pdf, cication: -1Maurice Weber, Daniel Fu, Quentin Anthony, ..., Irina Rish, Ce Zhang
-
🌟 smollm - huggingface
135M, 360M, and 1.7B parameters. · (smollm - huggingface) · (𝕏) · (huggingface)
-
Releasing the largest multilingual open pretraining dataset 🤗
-
🌟 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination,
arXiv, 2411.03823
, arxiv, pdf, cication: -1Dingjie Song, Sicheng Lai, Shunian Chen, ..., Lichao Sun, Benyou Wang · (MM-Detect - MLLM-Data-Contamination)
-
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction,
arXiv, 2410.21169
, arxiv, pdf, cication: -1Qintong Zhang, Victor Shea-Jay Huang, Bin Wang, ..., Conghui He, Wentao Zhang
-
· (arxiv)
-
Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. 🤗
-
Compute-Constrained Data Selection,
arXiv, 2410.16208
, arxiv, pdf, cication: -1Junjie Oscar Yin, Alexander M. Rush
-
🌟 A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. 🤗
-
· (Pangea - neulab)
-
Open-O1 - Open-Source-O1
A Model Matching Proprietary Power with Open-Source Innovation
-
🌟 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing,
arXiv, 2406.08464
, arxiv, pdf, cication: -1Zhangchen Xu, Fengqing Jiang, Luyao Niu, ..., Yejin Choi, Bill Yuchen Lin · (huggingface) · (magpie - magpie-align)
-
LongReward: Improving Long-context Large Language Models with AI Feedback,
arXiv, 2410.21252
, arxiv, pdf, cication: -1Jiajie Zhang, Zhongni Hou, Xin Lv, ..., Ling Feng, Juanzi Li · (LongReward - THUDM) · (huggingface)
-
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024] 🎬
-
How to Synthesize Text Data without Model Collapse?,
arXiv, 2412.14689
, arxiv, pdf, cication: -1Xuekai Zhu, Daixuan Cheng, Hengli Li, ..., Zilong Zheng, Bowen Zhou
-
Evaluating Language Models as Synthetic Data Generators,
arXiv, 2412.03679
, arxiv, pdf, cication: -1Seungone Kim, Juyoung Suk, Xiang Yue, ..., Sean Welleck, Graham Neubig
-
WizardArena: Post-training Large Language Models via Simulated Offline Chatbot Aren
· (arxiv)
-
🌟 WizardLM: Empowering Large Language Models to Follow Complex Instructions,
arXiv, 2304.12244
, arxiv, pdf, cication: -1Can Xu, Qingfeng Sun, Kai Zheng, ..., Chongyang Tao, Daxin Jiang
-
Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning,
arXiv, 2410.19290
, arxiv, pdf, cication: -1Yujian Liu, Shiyu Chang, Tommi Jaakkola, ..., Yang Zhang · (Prereq_tune.git - UCSB-NLP-Chang)
-
promptwright - StacklokLabs
· (reddit)
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas,
arXiv, 2406.20094
, arxiv, pdf, cication: 17Tao Ge, Xin Chan, Xiaoyang Wang, ..., Haitao Mi, Dong Yu
-
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch,
arXiv, 2410.18693
, arxiv, pdf, cication: -1Yuyang Ding, Xinyu Shi, Xiaobo Liang, ..., Qiaoming Zhu, Min Zhang