diff --git a/README.md b/README.md index 9ef2dee..22e842d 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@
- +
[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm) @@ -34,7 +34,7 @@ LLM alignment is essentially an online learning and decision making problem wher In our [paper](https://arxiv.org/abs/2411.01493), we formalize LLM alignment as a **contextual dueling bandit (CDB)** problem (see illustration below) and propose a sample-efficient alignment approach based on Thompson sampling.- +
The CDB framework necessitates an efficient online training system to validate the proposed method and compare it with other baselines. Oat 🌾 is developed as part of this research initiative. @@ -42,7 +42,7 @@ The CDB framework necessitates an efficient online training system to validate t Using the CDB framework, existing LLM alignment paradigms can be summarized as follows:- +
For more details, please check out our [paper](https://arxiv.org/abs/2411.01493)! @@ -128,7 +128,7 @@ python -m oat.experiment.main \ ```- +
Check out this [tutorial](./examples/) for more examples covering: @@ -140,11 +140,11 @@ Check out this [tutorial](./examples/) for more examples covering: The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.- +
- +
Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. @@ -175,4 +175,4 @@ We thank the following awesome projects that have contributed to the development ## Disclaimer -This is not an official Sea Limited or Garena Online Private Limited product. \ No newline at end of file +This is not an official Sea Limited or Garena Online Private Limited product.