Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen - supervised finetuning script and guide for SFT. #660

Open
1 task
irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Open
1 task

Qwen - supervised finetuning script and guide for SFT. #660

irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models chat-templates llm prompt templates for chat models Code-Interpreter OpenAI Code-Interpreter data-validation Validating data structures and formats dataset public datasets and embeddings finetuning Tools for finetuning of LLMs e.g. SFT or RLHF MachineLearning ML Models, Training and Inference shell-script shell scripting in Bash, ZSH, POSIX etc

Comments

@irthomasthomas
Copy link
Owner

Example - Qwen

DESCRIPTION:
Here we provide a very simple script for supervised finetuning, which is revised from the training script in Fastchat. The script is used to finetune Qwen with Hugging Face Trainer. You can check the script here. This script for supervised finetuning (SFT) has the following features:

  • Support single-GPU and multi-GPU training;
  • Support full-parameter tuning, LoRA, and Q-LoRA.

In the following, we introduce more details about the usage of the script.

Installation
Before you start, make sure you have installed the following packages:

pip install peft deepspeed optimum accelerate

Data Preparation
For data preparation, we advise you to organize the data in a jsonl file, where each line is a dictionary as demonstrated below:

{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Tell me something about large language models."
        },
        {
            "role": "assistant",
            "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
        }
    ],
    "source": "unknown"
}
{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is your name?"
        },
        {
            "role": "assistant",
            "content": "My name is Qwen."
        }
    ],
    "source": "self-made"
}

Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: type, messages, and source. messages is required while the others are optional for you to label your data format and data source. The messages field is a list of JSON objects, each of which has two fields: role and content. role can be system, user, or assistant. content is the text of the message. source is the source of the data, which can be self-made, alpaca, open-hermes, or any other string.

To make the jsonl file, you can use json to save a list of dictionaries to the jsonl file:

import json

with open('data.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

Quickstart
For you to start finetuning quickly, we directly provide a shell script for you to run without paying attention to details. You need different hyperparameters for different types of training, e.g., single-GPU / multi-GPU training, full-parameter tuning, LoRA, or Q-LoRA.

cd examples/sft
bash finetune.sh -m <model_path> -d <data_path> --deepspeed <config_path> [--use_lora True] [--q_lora True]

Specify the <model_path> for your model, <data_path> for your data, and <config_path> for your deepspeed configuration. If you use LoRA or Q-LoRA, just add --use_lora True or --q_lora True based on your requirements. This is the simplest way to start finetuning. If you want to change more hyperparameters, you can dive into the script and modify those parameters.

Advanced Usages
In this section, we introduce the details of the scripts, including the core python script as well as the corresponding shell script.

Shell Script
Before we introduce the python code, we provide a brief introduction to the shell script with commands. We provide some guidance inside the shell script and here we take finetune.sh as an example.

To set up the environment variables for distributed training (or single-GPU training), specify the following variables: GPUS_PER_NODE, NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT. No need to worry too much about them as we provide the default settings for you. In the command, you can pass in the argument -m and -d to specify the model path and data path, respectively. You can also pass in the argument --deepspeed to specify the deepspeed configuration file. We provide two configuration files for ZeRO2 and ZeRO3, and you can choose one based on your requirements. In most cases, we recommend using ZeRO3 for multi-GPU training except for Q-LoRA, where we recommend using ZeRO2.

There are a series of hyperparameters to tune. Passing in --bf16 or --fp16 to specify the precision for mixed precision training. The other significant hyperparameters include:

  • --output_dir: the path of your output models or adapters.
  • --num_train_epochs: the number of training epochs.
  • --gradient_accumulation_steps: the number of gradient accumulation steps.
  • --per_device_train_batch_size: the batch size per GPU for training, and the total batch size is equal to per_device_train_batch_size * number_of_gpus * gradient_accumulation_steps.
  • --learning_rate: the learning rate.
  • --warmup_steps: the number of warmup steps.
  • --lr_scheduler_type: the type of learning rate scheduler.
  • --weight_decay: the value of weight decay.
  • --adam_beta2: the value of in Adam.
  • --model_max_length: the maximum sequence length.
  • --use_lora: whether to use LoRA. Adding --q_lora can enable Q-LoRA.
  • --gradient_checkpointing: whether to use gradient checkpointing.

URL: https://qwen.readthedocs.io/en/latest/training/SFT/example.html

Suggested labels

@irthomasthomas irthomasthomas added AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models Code-Interpreter OpenAI Code-Interpreter data-validation Validating data structures and formats MachineLearning ML Models, Training and Inference shell-script shell scripting in Bash, ZSH, POSIX etc labels Feb 28, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#324: bigcode/tiny_starcoder_py · Hugging Face

### DetailsSimilarity score: 0.9 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

#167: Model training code from NousResearch/StripedHyenaTrainer

### DetailsSimilarity score: 0.89 - [ ] [NousResearch/StripedHyenaTrainer](https://github.com/NousResearch/StripedHyenaTrainer)
This is the training code used to train StripedHyena-Nous-7B.

First, tokenize your data

python tokenization.py
--dataset your-super-cool-sharegpt-format-dataset
--tokenizer togethercomputer/StripedHyena-Hessian-7B
--output tokenized
--num-proc 32
--pad-to-length 4096
--truncate
Make sure you have done accelerate config -- we used the provided DeepSpeed configuration. Then, train!

accelerate launch finetune.py
--model togethercomputer/StripedHyena-Hessian-7B
--dataset tokenized
--output-dir output
--epochs 4
--batch-size 12
--gradient-accumulate-every 12
--warmup-steps 350
--learning-rate 0.000004
--lr-schedule linear
--weight-decay 0.1
--checkpointing-steps 1000
--no-decay poles residues
The --no-decay option disables weight decay on only the specified parameters. For StripedHyena, we've found that disabling weight decay on the Hyena operator's poles and residues parameters improves performance. There is also an option --frozen that can completely freeze select parameter groups.

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### DetailsSimilarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

  • We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!


VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

### DetailsSimilarity score: 0.89 - [ ] [awq llama quantization](huggingface.co)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.
  • Read more: GitHub - casper-hansen/AutoAWQ

Important Note

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
  • This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.89 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

  • Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
  • Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
  • Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
  • Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at [email protected].

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"

### DetailsSimilarity score: 0.88 - [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)

HumanEval: Hand-Written Evaluation Set

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".

Installation

Make sure to use python 3.7 or later:

$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:

{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.

Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples, run

$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}
This script provides more fine-grained information in a new file ending in <input_path>_results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".

As a quick sanity-check, the example samples should yield 0.5 pass@1.

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00, 1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}
Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see

$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.

Known Issues

While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.

malloc: can't allocate region
Citation

Please cite using the following bibtex entry:

@Article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Suggested labels

{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models chat-templates llm prompt templates for chat models Code-Interpreter OpenAI Code-Interpreter data-validation Validating data structures and formats dataset public datasets and embeddings finetuning Tools for finetuning of LLMs e.g. SFT or RLHF MachineLearning ML Models, Training and Inference shell-script shell scripting in Bash, ZSH, POSIX etc
Projects
None yet
Development

No branches or pull requests

1 participant