From bd76f1d072b0031a0b29f3599b9df94ae950bf3a Mon Sep 17 00:00:00 2001 From: zclzc <38581401+lkevinzc@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:48:15 +0800 Subject: [PATCH 1/4] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9ef2dee..ad3ca97 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@
- +
[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm) @@ -175,4 +175,4 @@ We thank the following awesome projects that have contributed to the development ## Disclaimer -This is not an official Sea Limited or Garena Online Private Limited product. \ No newline at end of file +This is not an official Sea Limited or Garena Online Private Limited product. From cde0b050920c5f242385d2d4217199066d884be5 Mon Sep 17 00:00:00 2001 From: zclzc <38581401+lkevinzc@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:50:36 +0800 Subject: [PATCH 2/4] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index ad3ca97..6f1b34e 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ LLM alignment is essentially an online learning and decision making problem wher In our [paper](https://arxiv.org/abs/2411.01493), we formalize LLM alignment as a **contextual dueling bandit (CDB)** problem (see illustration below) and propose a sample-efficient alignment approach based on Thompson sampling.- +
The CDB framework necessitates an efficient online training system to validate the proposed method and compare it with other baselines. Oat 🌾 is developed as part of this research initiative. @@ -42,7 +42,7 @@ The CDB framework necessitates an efficient online training system to validate t Using the CDB framework, existing LLM alignment paradigms can be summarized as follows:- +
For more details, please check out our [paper](https://arxiv.org/abs/2411.01493)! @@ -128,7 +128,7 @@ python -m oat.experiment.main \ ```- +
Check out this [tutorial](./examples/) for more examples covering: @@ -140,11 +140,11 @@ Check out this [tutorial](./examples/) for more examples covering: The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.- +
- +
Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. From e48203ea940563056b5cc9f4e7cb0114a606474d Mon Sep 17 00:00:00 2001 From: zclzc <38581401+lkevinzc@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:52:53 +0800 Subject: [PATCH 3/4] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 6f1b34e..f8d8260 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ LLM alignment is essentially an online learning and decision making problem wher In our [paper](https://arxiv.org/abs/2411.01493), we formalize LLM alignment as a **contextual dueling bandit (CDB)** problem (see illustration below) and propose a sample-efficient alignment approach based on Thompson sampling.- +
The CDB framework necessitates an efficient online training system to validate the proposed method and compare it with other baselines. Oat 🌾 is developed as part of this research initiative. @@ -128,7 +128,7 @@ python -m oat.experiment.main \ ```- +
Check out this [tutorial](./examples/) for more examples covering: @@ -140,11 +140,11 @@ Check out this [tutorial](./examples/) for more examples covering: The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.- +
- +
Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. From 451ebfafe5925a580cfb8b5068af772cc934ba7a Mon Sep 17 00:00:00 2001 From: zclzc <38581401+lkevinzc@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:53:55 +0800 Subject: [PATCH 4/4] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index f8d8260..22e842d 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ python -m oat.experiment.main \ ```- +
Check out this [tutorial](./examples/) for more examples covering: @@ -140,11 +140,11 @@ Check out this [tutorial](./examples/) for more examples covering: The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.- +
- +
Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results.