O1-CODER: An O1 Replication for Coding (Paper)
O1-CODER is an attempt to replicate OpenAI's O1 model, focused on coding tasks. The approach combines Reinforcement Learning (RL) and Monte Carlo Tree Search (MCTS) to enhance the model’s System-2 thinking capabilities, aiming to generate more efficient and logical code.
The core components of O1-CODER are:
- Test Case Generator (TCG): Automatically generates standardized test cases to evaluate the correctness of the generated code.
- Self-Play and Reinforcement Learning: The model generates reasoning data through self-play, and uses RL and MCTS to iteratively optimize the policy model. These methods work in an iterative cycle, continuously refining the model to improve systematic reasoning and optimization in coding tasks.
- Updated the Reward Aggregator
- Updated the training code for the process reward model and Test Case Generator.
- Updated the MCTS-based data synthesis code for O1-CODER.
- Updated the technical report for O1-CODER.
TODO: Reinforcement Learning code, Curated datasets and derived models
TODO: Reinforcement Fine-Tuning (RFT) Version of O1-Coder Due to the characteristics of the test case generator, O1-Coder can generate diverse process supervision data with only a small amount of ground truth code. Therefore, in the RFT version, we will skip the use of CoT data for initializing the policy model.
This work is released under the MIT License. See the LICENSE file for more details. By using this code or associated materials, you agree to comply with the terms outlined in the license.
If you use O1-CODER or parts of this work in your research or applications, please cite the following paper:
@misc{zhang2024o1codero1replicationcoding,
title={O1-Coder: An O1 Replication for Coding},
author={Yuxiang Zhang and Shangxi Wu and Yuqi Yang and Jiangming Shu and Jinlin Xiao and Chao Kong and Jitao Sang},
year={2024},
eprint={2412.00154},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2412.00154},
}