Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update API to use latest TRL #182

Merged
merged 17 commits into from
Jul 30, 2024
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ If you would like to train chat models on your own datasets, we recommend follow

The initial release of the handbook will focus on the following techniques:

* **Continued pretraining:** adapt language models to a new language or domain, or simply improve it by continue pretraning (causal language modeling) on a new dataset.
* **Supervised fine-tuning:** teach language models to follow instructions and tips on how to collect and curate your own training dataset.
* **Continued pretraining:** adapt language models to a new language or domain, or simply improve it by continued pretraining (causal language modeling) on a new dataset.
* **Supervised fine-tuning:** teach language models to follow instructions and tips on how to collect and curate your training dataset.
* **Reward modeling:** teach language models to distinguish model responses according to human or AI preferences.
* **Rejection sampling:** a simple, but powerful technique to boost the performance of your SFT model.
* **Direct preference optimisation (DPO):** a powerful and promising alternative to PPO.
Expand Down
2 changes: 1 addition & 1 deletion recipes/constitutional-ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con

## Advanced: generating you own dataset

To generate the constitutional AI dataset, see https://github.com/huggingface/llm-swarm/tree/main/examples/constitutional-ai for detailed instructions if you want build or customize the dataset.
To generate the constitutional AI dataset, see https://github.com/huggingface/llm-swarm/tree/main/examples/constitutional-ai for detailed instructions if you want to build or customize the dataset.
2 changes: 1 addition & 1 deletion recipes/constitutional-ai/dpo/config_anthropic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ bf16: true
beta: 0.1
do_eval: true
do_train: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 1000
gradient_accumulation_steps: 1
gradient_checkpointing: true
Expand Down
4 changes: 2 additions & 2 deletions recipes/constitutional-ai/sft/config_anthropic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
Expand All @@ -18,7 +18,7 @@ preprocessing_num_workers: 12
bf16: true
do_eval: true
do_train: true
evaluation_strategy: epoch # One of ["no", "steps", "epoch"]
eval_strategy: epoch # One of ["no", "steps", "epoch"]
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
4 changes: 2 additions & 2 deletions recipes/constitutional-ai/sft/config_grok.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
Expand All @@ -18,7 +18,7 @@ preprocessing_num_workers: 12
bf16: true
do_eval: true
do_train: true
evaluation_strategy: epoch # One of ["no", "steps", "epoch"]
eval_strategy: epoch # One of ["no", "steps", "epoch"]
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
4 changes: 2 additions & 2 deletions recipes/gpt2-nl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This directory shows a base example of how to use continued pretraining and further tuning to adapt a language model to new data (e.g. a new language or domain).

Three steps are needed: continued pretraining (`cpt`), supervised finetuning (`sft`), and direct preference optimisation (`dpo`). In this dummy example we'll continue pretraining gpt2 on Dutch raw data, then sft-tuning it, and finally aligning it with DPO. Note that no extensive hyperparameters were tested in this example and that the output models are bad - it is just to show you how you can use the scripts for LM adaptation. The scripts work on 4x 3090s (24GB VRAM). If you have less powerful hardware you may need to reduce the batch size.
Three steps are needed: continued pretraining (`cpt`), supervised finetuning (`sft`), and direct preference optimisation (`dpo`). In this dummy example, we'll continue pretraining gpt2 on Dutch raw data, then sft-tuning it, and finally aligning it with DPO. Note that no extensive hyperparameters were tested in this example and that the output models are bad - it is just to show you how you can use the scripts for LM adaptation. The scripts work on 4x 3090s (24GB VRAM). If you have less powerful hardware you may need to reduce the batch size.

## Continued pretraining

Expand All @@ -18,7 +18,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch \

## Supervised finetuning

As other recipes, such as the famous zephyr-7b-beta recipe, have shown, we can then teach our model how to hold a conversation by finetuning it on chat-formatted data. As a base model we'll make use of the output of the previous step.
As other recipes, such as the famous zephyr-7b-beta recipe, have shown, we can then teach our model how to hold a conversation by finetuning it on chat-formatted data. As a base model, we'll make use of the output of the previous step.

```shell
ACCELERATE_LOG_LEVEL=info accelerate launch \
Expand Down
2 changes: 1 addition & 1 deletion recipes/gpt2-nl/cpt/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ preprocessing_num_workers: 12
# SFT trainer config
bf16: true
do_eval: False
evaluation_strategy: "no"
eval_strategy: "no"
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
2 changes: 1 addition & 1 deletion recipes/gpt2-nl/dpo/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ preprocessing_num_workers: 12
bf16: true
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
Expand Down
2 changes: 1 addition & 1 deletion recipes/gpt2-nl/sft/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ preprocessing_num_workers: 12
# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
eval_strategy: epoch
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
5 changes: 3 additions & 2 deletions recipes/pref_align_scan/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ This directory contains various comparisons for three algorithms: DPO, IPO, and
- OpenHermes-2.5 and the OpenOrca datasets

We release a collection containing the datasets and models used for these experiments, if you require the other trained models, we can release them on request.
You can find a longer decription of there results in our [blogpost](https://huggingface.co/blog/pref-tuning)
You can find a longer description of these results in our [blogpost](https://huggingface.co/blog/pref-tuning)

## Comparisons
For each algorithm, we aim to tune the beta parameter for a fixed learning rate. We vary beta from 0.1-0.9 in steps of 0.1, we have also found that in certain configurations a tiny value of beta, 0.01, can be effective. So we have included this smaller value in all our comparisons.

## Usage
The experiments can be launched with the following bash script:
```
```bash
#!/bin/bash

# Define an array containing the base configs we wish to fine tune
Expand Down
2 changes: 1 addition & 1 deletion recipes/pref_align_scan/dpo/config_openhermes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ beta: 0.01
loss_type: sigmoid
do_eval: true
do_train: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: true
Expand Down
2 changes: 1 addition & 1 deletion recipes/pref_align_scan/dpo/config_zephyr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ bf16: true
beta: 0.01
loss_type: sigmoid
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: true
Expand Down
2 changes: 1 addition & 1 deletion recipes/starchat2-15b/dpo/config_v0.1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ preprocessing_num_workers: 12
bf16: true
beta: 0.05
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
Expand Down
4 changes: 2 additions & 2 deletions recipes/starchat2-15b/sft/config_v0.1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: bigcode/starcoder2-15b
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
Expand All @@ -20,7 +20,7 @@ preprocessing_num_workers: 24
# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
eval_strategy: epoch
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
2 changes: 1 addition & 1 deletion recipes/zephyr-141b-A35b/orpo/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: mistral-community/Mixtral-8x22B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
Expand Down
10 changes: 5 additions & 5 deletions recipes/zephyr-7b-beta/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
As described in the Zephyr [technical report](https://huggingface.co/papers/2310.16944), training this model proceeds in two steps:

1. Apply SFT to fine-tune Mistral 7B on a filtered version of the UltraChat dataset ([link](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)). The result is an SFT model like [`zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) or [`zephyr-7b-sft-qlora`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-qlora).
2. Align the SFT model to AI feedback via DPO on a preprocessed version of the UltraFeedback dataset ([link](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)). The result is an DPO model like [`zephyr-7b-dpo-full`](https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full) or [`zephyr-7b-dpo-qlora`](https://huggingface.co/alignment-handbook/zephyr-7b-dpo-qlora).
2. Align the SFT model to AI feedback via DPO on a preprocessed version of the UltraFeedback dataset ([link](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)). The result is a DPO model like [`zephyr-7b-dpo-full`](https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full) or [`zephyr-7b-dpo-qlora`](https://huggingface.co/alignment-handbook/zephyr-7b-dpo-qlora).

**Note:** after the release of Zephyr, the team at [Argilla](https://argilla.io) found that the source UltraFeedback dataset had a few thousand incorrect preference labels from GPT-4. Additionally, TRL's `SFTTrainer` had a bug in the learning rate scheduler which terminated training early. Accounting for these changes led us to find a better set of hyperparameters from those described in the technical report. In particular, for DPO training we found that training for 1 epoch with `beta=0.01` was suffucient to achieve comparable performance to `zephyr-7b-beta` (vs. 3 epochs with `beta=0.1`).
**Note:** after the release of Zephyr, the team at [Argilla](https://argilla.io) found that the source UltraFeedback dataset had a few thousand incorrect preference labels from GPT-4. Additionally, TRL's `SFTTrainer` had a bug in the learning rate scheduler which terminated training early. Accounting for these changes led us to find a better set of hyperparameters from those described in the technical report. In particular, for DPO training we found that training for 1 epoch with `beta=0.01` was sufficient to achieve comparable performance to `zephyr-7b-beta` (vs. 3 epochs with `beta=0.1`).

See below for commands to train these models using either DeepSpeed ZeRO-3 or LoRA.

Expand Down Expand Up @@ -34,11 +34,11 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con

P.S. Using Flash Attention also allows you to drastically increase the batch size (x2 in my case)

Train without flash-attention:
Train without flash-attention (i.e. via PyTorch's scaled dot product attention):
```````shell
# Step 1 - SFT
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_qlora.yaml --load_in_4bit=true --use_flash_attention_2=false
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_qlora.yaml --load_in_4bit=true --attn_implementation=sdpa

# Step 2 - DPO
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_qlora.yaml --use_flash_attention_2=false
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_qlora.yaml --attn_implementation=sdpa
```````
2 changes: 1 addition & 1 deletion recipes/zephyr-7b-beta/dpo/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ preprocessing_num_workers: 12
bf16: true
beta: 0.01
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: true
Expand Down
4 changes: 2 additions & 2 deletions recipes/zephyr-7b-beta/dpo/config_qlora.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Model arguments
model_name_or_path: alignment-handbook/zephyr-7b-sft-qlora
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# LoRA arguments
use_peft: true
Expand Down Expand Up @@ -31,7 +31,7 @@ preprocessing_num_workers: 12
bf16: true
beta: 0.01
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 4
gradient_checkpointing: true
Expand Down
4 changes: 2 additions & 2 deletions recipes/zephyr-7b-beta/sft/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
Expand All @@ -16,7 +16,7 @@ preprocessing_num_workers: 12
# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
eval_strategy: epoch
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
4 changes: 2 additions & 2 deletions recipes/zephyr-7b-beta/sft/config_qlora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
model_name_or_path: mistralai/Mistral-7B-v0.1
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# LoRA arguments
load_in_4bit: true
Expand Down Expand Up @@ -31,7 +31,7 @@ preprocessing_num_workers: 12
# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
eval_strategy: epoch
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
2 changes: 1 addition & 1 deletion recipes/zephyr-7b-gemma/dpo/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ preprocessing_num_workers: 12
bf16: true
beta: 0.05
do_eval: true
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
Expand Down
4 changes: 2 additions & 2 deletions recipes/zephyr-7b-gemma/sft/config_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ model_name_or_path: google/gemma-7b
model_revision: main
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml # Custom tokenizer with <|im_start|> and <|im_end|> tokens
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2

# Data training arguments
dataset_mixer:
Expand All @@ -19,7 +19,7 @@ dataset_kwargs:
add_special_tokens: false # We already wrap <bos> and <eos> in the chat template
append_concat_token: false # No need to add <eos> across samples
do_eval: true
evaluation_strategy: epoch
eval_strategy: epoch
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
Expand Down
9 changes: 4 additions & 5 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/fsdp+qlora.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_qlora.yaml --torch_dtype=bfloat16 --bnb_4bit_quant_storage=bfloat16
```

Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported:
Here `{task}` refers to the type of training you wish to run. Currently, the following tasks are supported:
* continued pretraining `cpt` (note that `cpt` is only present in the `gpt-nl` example recipe)
* supervised finetuning `sft`
* direct preference optimisation `dpo`
Expand All @@ -54,8 +54,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
```

## Logging with Weights and Biases

By default all training metrics are logged with TensorBoard. If you have a [Weights and Biases](https://wandb.ai/site) account and are logged in, you can view the training metrics by appending `--report_to=wandb`, e.g.
By default, all training metrics are logged with TensorBoard. If you have a [Weights and Biases](https://wandb.ai/site) account and are logged in, you can view the training metrics by appending `--report_to=wandb`, e.g.

```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml --report_to=wandb
Expand Down Expand Up @@ -120,7 +119,7 @@ If you format your dataset in the same way, our training scripts should work out
We recommend benchmarking chat models on:

* [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench): a multi-turn benchmark spanning 80 dialogues and 10 domains.
* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval): a single-turn benchmark which evaluates the helpfulness of chat and instruct models against `text-davinci-003`.
* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval): a single-turn benchmark that evaluates the helpfulness of chat and instruct models against `text-davinci-003`.

For both benchmarks, we have added support for the [Zephyr chat template](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full/blob/ac6e600eefcce74f5e8bae1035d4f66019e93190/tokenizer_config.json#L30) (which is the default produced by our scripts), so you can evaluate models produced by our scripts as follows:

Expand All @@ -137,6 +136,6 @@ For both benchmarks, we have added support for the [Zephyr chat template](https:
* Next, update the [config name](https://github.com/tatsu-lab/alpaca_eval/blob/2daa6e11b194653043ca74f735728dc068e04aae/src/alpaca_eval/models_configs/zephyr-7b-beta/configs.yaml#L1) and [Hub model ID](https://github.com/tatsu-lab/alpaca_eval/blob/2daa6e11b194653043ca74f735728dc068e04aae/src/alpaca_eval/models_configs/zephyr-7b-beta/configs.yaml#L5) to match your model name.
* Follow the steps to evaluate your model [here](https://github.com/tatsu-lab/alpaca_eval/tree/main#evaluating-a-model).

Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibit various biases including a preference for models distilled from GPTs. For that reason, we also recommend submitting your best models for human evaluation in:
Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibits various biases including a preference for models distilled from GPTs. For that reason, we also recommend submitting your best models for human evaluation in:

* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
Loading
Loading