-
Enabling Scalable Oversight via Self-Evolving Critic,
arXiv, 2501.05727
, arxiv, pdf, cication: -1Zhengyang Tang, Ziniu Li, Zhenyang Xiao, ..., Bowen Yu, Junyang Lin
-
🌟 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models,
arXiv, 2501.03262
, arxiv, pdf, cication: -1Jian Hu
· (OpenRLHF - OpenRLHF)
-
🌟 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,
arXiv, 2402.03300
, arxiv, pdf, cication: 155Zhihong Shao, Peiyi Wang, Qihao Zhu, ..., Y. Wu, Daya Guo · (𝕏)
-
REINFORCE++: A SIMPLE AND EFFICIENT APPROACH FOR ALIGNING LARGE LANGUAGE MODELS
-
🌟 VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment,
arXiv, 2410.01679
, arxiv, pdf, cication: -1Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, ..., Aaron Courville, Nicolas Le Roux · (VinePPO - McGill-NLP)
-
Analyzing OpenAI’s Reinforcement Fine-Tuning: Less Data, Better Results
-
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization,
arXiv, 2410.09302
, arxiv, pdf, cication: -1Guanlin Liu, Kaixuan Ji, Renjie Zheng, ..., Quanquan Gu, Lin Yan
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning,
arXiv, 2412.16145
, arxiv, pdf, cication: -1Huaijie Wang, Shibo Hao, Hanze Dong, ..., Ziran Yang, Yi Wu · (OREO - jwhj)
-
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering,
arXiv, 2411.11504
, arxiv, pdf, cication: -1Xinyan Guan, Yanjiang Liu, Xinyu Lu, ..., Yaojie Lu, Hongyu Lin
-
Evaluating the role of
Constitutions
for learning from AI feedback,arXiv, 2411.10168
, arxiv, pdf, cication: -1Saskia Redgate, Andrew M. Bean, Adam Mahdi
-
🌟 Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback,
arXiv, 2406.09279
, arxiv, pdf, cication: -1Hamish Ivison, Yizhong Wang, Jiacheng Liu, ..., Yejin Choi, Hannaneh Hajishirzi · (EasyLM)) - hamishivi) · (open-instruct)) - allenai) · (huggingface.))
-
🌟 Tülu 3: The next era in open post-training
· (hf) · (hf) · (hf) · (open-instruct - allenai) · (olmes - allenai) · (playground.allenai)
-
🌟 Everything You Wanted to Know About LLM Post-Training, with Nathan Lambert of Allen Institute for AI 🎬
-
Direct Preference Optimization Using Sparse Feature-Level Constraints,
arXiv, 2411.07618
, arxiv, pdf, cication: -1Qingyu Yin, Chak Tou Leong, Hongbo Zhang, ..., Yue Zhang, Linyi Yang
-
Mira: A Decentralized Network for Trustless AI Output Verification
· (mira) · (huggingface)
-
Self-Evolved Reward Learning for LLMs,
arXiv, 2411.00418
, arxiv, pdf, cication: -1Chenghua Huang, Zhizhen Fan, Lu Wang, ..., Saravan Rajmohan, Qi Zhang
-
Self-Consistency Preference Optimization,
arXiv, 2411.04109
, arxiv, pdf, cication: -1Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, ..., Jason Weston, Jane Yu
-
Evolving Alignment via Asymmetric Self-Play,
arXiv, 2411.00062
, arxiv, pdf, cication: -1Ziyu Ye, Rishabh Agarwal, Tianqi Liu, ..., Qijun Tan, Yuan Liu · (jiqizhixin)
-
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback,
arXiv, 2410.19133
, arxiv, pdf, cication: -1Lester James V. Miranda, Yizhong Wang, Yanai Elazar, ..., Hannaneh Hajishirzi, Pradeep Dasigi
-
LongReward: Improving Long-context Large Language Models with AI Feedback,
arXiv, 2410.21252
, arxiv, pdf, cication: -1Jiajie Zhang, Zhongni Hou, Xin Lv, ..., Ling Feng, Juanzi Li · (LongReward - THUDM) · (huggingface)
-
Thinking LLMs: General Instruction Following With Thought Generation 𝕏
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models,
arXiv, 2410.17637
, arxiv, pdf, cication: -1Ziyu Liu, Yuhang Zang, Xiaoyi Dong, ..., Dahua Lin, Jiaqi Wang
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs,
arXiv, 2410.18451
, arxiv, pdf, cication: -1Chris Yuhao Liu, Liang Zeng, Jiacai Liu, ..., Yang Liu, Yahui Zhou
· (huggingface) · (huggingface) · (huggingface)
-
benchmark: Preference Proxy Evaluations (PPE) 𝕏
· (blog.lmarena) · (arxiv) · (PPE - lmarena)
-
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style,
arXiv, 2410.16184
, arxiv, pdf, cication: -1Yantao Liu, Zijun Yao, Rui Min, ..., Lei Hou, Juanzi Li · (RM-Bench - THU-KEG)
-
🌟 OpenRLHF - OpenRLHF
· (arxiv) · (docs.google)
-
verl - volcengine
Volcano Engine Reinforcement Learning for LLM · (arxiv)
- Why RLHF (and Other RL-Like Methods) Don’t Bring “True RL” to LLMs—and Why It Matters Report this article
- Advanced Tricks for Training Large Language Models with Proximal Policy Optimization
- Tulu 3: Exploring Frontiers in Open Language Model Post-Training - Nathan Lambert (AI2) 🎬
- 🎬 Generative Reward Models: Merging the Power of RLHF and RLAIF for Smarter AI