FastChat是一个用于训练、部署和评估基于LLM(大型语言模型)的聊天机器人的开放平台。通过使用huggingface transformers支持LLM基于deepspeed/fsdp的多节点多卡微调;下述介绍使用FastChat微调yuan2.0模型的流程。
- docker pull nvcr.io/nvidia/pytorch:23.08-py3
- docker run -v HOST_WORK_PATH:/workspace/ --ipc=host --gpus all -p host-port:container-port --shm-size='64g' -it nvcr.io/nvidia/pytorch:23.08-py3 /bin/bash
- git clone https://github.com/lm-sys/FastChat.git
- cd FastChat
- pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
- pip install -e ".[model_worker,webui,train]"
- pip install deepspeed “bitsandbytes>=0.39.0” “transformers==4.31.0” plotly openai
- 获取yuan2.0 huggingface模型文件:
- 准备数据:FastChat为聊天机器人训练及服务做支持,因此其所需要的标准数据集为多轮及单轮对话数据集。
(1)自定义数据集时,使用fastchat所要求的数据格式,如使用下述json格式的文件定义单论或多轮对话数据集。
(2)使用已有的指令数据集改造为单轮对话,可以使用alpaca-data英文或中文数据集进行对应格式的改造。
(3)使用开源的多轮对话数据集,如BELLE项目开源的用户与助手的多轮对话数据集bella-0.8M。
#multi turns example
[
[
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
},
{
"from": "human",
"value": "Have a nice day!"
},
{
"from": "gpt",
"value": "You too!"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
}
]
},
]
# single turn example
[
{
"id": "1",
"conversations": [
{
"from": "human",
"value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:"
},
{
"from": "gpt",
"value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
}
]
},
{
"id": "2",
"conversations": [
{
"from": "human",
"value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:"
},
{
"from": "gpt",
"value": "The three primary colors are red, blue, and yellow."
}
]
},
在fastchat/train/train_mem.py脚本中
- from fastchat.train.train import train
+ from fastchat.train.train_yuan2 import train
在fastchat/train/train_lora.py脚本中
-from fastchat.train.train import (
- DataArguments,
- ModelArguments,
- make_supervised_data_module,
-)
+from fastchat.train.train_yuan2 import (
+ DataArguments,
+ ModelArguments,
+ make_supervised_data_module,
+)
将fastchat/train/train_yuan2.py脚本中的special tokenizer复制到train_lora.py
+tokenizer.add_tokens(
+ [
+ "<eod>",
+ "<sep>",
+ "<pad>",
+ "<mask>",
+ "<predict>",
+ "<FIM_SUFFIX>",
+ "<FIM_PREFIX>",
+ "<FIM_MIDDLE>",
+ "<commit_before>",
+ "<commit_msg>",
+ "<commit_after>",
+ "<jupyter_start>",
+ "<jupyter_text>",
+ "<jupyter_code>",
+ "<jupyter_output>",
+ "<empty_output>",
+ ],
+ special_tokens=True,
+ )
fastchat中添加的yuan2_template相关信息,以下内容无需修改,开发者如有特殊需求可调整或改变如下相关模板信息
#yuan template infomation
fastchat/conversation.py脚本,包含yuan2.0 chat定制的模板信息
# Yuan2.0 chat template
# source: https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf/blob/main/tokenizer_config.json#L6
register_conv_template(
Conversation(
name="yuan2",
roles=("user", "assistant"),
sep_style=SeparatorStyle.YUAN2,
sep="<sep>",
sep2="\n",
stop_token_ids=[
77185,
], # "<eod>"
stop_str="<eod>",
)
)
fastchat/model/model_adapter.py脚本, 包含yuan2.0 chat模型及tokenizer加载时的函数
class Yuan2Adapter(BaseModelAdapter):
"""The model adapter for Yuan2.0"""
def match(self, model_path: str):
return "yuan2" in model_path.lower()
def load_model(self, model_path: str, from_pretrained_kwargs: dict):
revision = from_pretrained_kwargs.get("revision", "main")
# from_pretrained_kwargs["torch_dtype"] = torch.bfloat16
tokenizer = LlamaTokenizer.from_pretrained(
model_path,
add_eos_token=False,
add_bos_token=False,
eos_token='<eod>',
eod_token='<eod>',
sep_token='<sep>',
revision = revision,
)
tokenizer.add_tokens(
['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>', '<commit_before>',
'<commit_msg>', '<commit_after>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>',
'<jupyter_output>', '<empty_output>'], special_tokens=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
# device_map='auto',
trust_remote_code=True,
**from_pretrained_kwargs
)
return model, tokenizer
def get_default_conv_template(self, model_path: str) -> Conversation:
return get_conv_template("yuan2")
fastchat/model/model_yuan2.py脚本,包含yuan2.0 chat模型生成内容时的默认设置
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train_mem.py \
--model_name_or_path path-to-huggingface-models \
--trust_remote_code True\
--data_path ./data/alpaca_data_zh_conversion.json \
--bf16 True \
--output_dir ./test_yuan2b_full \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 1024 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed playground/zero2_ds_woloading.json \
--efficient_loss False \
--split_example_loss True \
--last_response_loss False \
--model_max_length 可以指定微调时单个样本最大长度;
--efficient_loss,--split_example_loss,--last_response_loss,代表了三种不同的面对多轮对话的loss计算方式。(1) efficient_loss代表计算聊天助手回答部分的loss;(2) last_response_loss代表只计算最后一轮聊天助手回答部分的loss;(3) split_example_loss代表将多轮对话拆分成多组样本,计算每组样本中最后一轮聊天助手内容部分的loss。选择时有且仅有一个为True,其余为False。
其余参数可以参考fastchat源码及transformers理解
- zero2 config文件参考
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": false,
"contiguous_gradients": true
},
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true
}
对大模型进行全量微调是一件昂贵的事情,我们可以使用高效微调的方法,通过给大模型添加额外参数,对新添加的参数进行微调进而改进大模型性能,如lora、Qlora高效微调方案。
lora本质上是一种重参数化方法,通过在参数矩阵添加旁支,来微调大模型性能。lora通过只在部分权重矩阵上添加旁支,来降低计算量;通过只更新旁支矩阵的参数,降低显存占用及并行通信量。
Qlora在lora的基础上将模型权重量化为4bit,并将scale参数再进行一次量化(double quant),以达到显存进一步节省的目的。需要注意的是Qlora相比于lora一般会添加更多的旁支矩阵,其并不能加速计算,反而会有效率上的损失。
使用fastchat可以通过如下方式非常方便对yuan2.0模型进行基于lora和Qlora的高效微调。
- 使用train_lora.py脚本,torchrun --nproc_per_node=8 --master_port=XXXX fastchat/train/train_lora.py .....
- 使用--lora_target_modules指定模型添加的lora模块,可以指定"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"中的一个或多个,默认使用"q_proj", "v_proj"
- 使用--lora_r指定lora矩阵的秩
- 当指定--q_lora (True or False)指定是否使用Qlora进行高效微调
- 高效微调在进行多轮对话微调时loss计算方式与全量微调一致,可以使用yuan2.0定义的三种不同方式中的一种
高效微调参考脚本如下:
CUDA_VISIBLE_DEVICES=0 python fastchat/train/train_lora.py \
--model_name_or_path hf-to-yuan-path \
--trust_remote_code True\
--data_path ./data/alpaca-data-conversation.json \
--bf16 True \
--output_dir ./checkpoints_yuan2_2b_lora \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 512 \
--gradient_checkpointing True \
--lazy_preprocess True \
--q_lora True \
--efficient_loss False \
--split_example_loss True \
--last_response_loss False \
微调方案 | 序列长度 | Model | 精度:加载/计算 | GPU | bs:micro/global | 显存占用(1*GPU) | epoch耗时 |
---|---|---|---|---|---|---|---|
ds_zero2_full | 2048 | Yuan-2 2B | bf16/bf16 | 8*L20 | 1/128 | 16G | 1.68h |
ds_zero3_lora | 2048 | Yuan-2 51B | bf16/bf16 | 8*L20 | 1/128 | 43G | 23h |
ds_zero3_lora | 2048 | Yuan-2 102B | bf16/bf16 | 8*L20 | 1/128 | 45G | 47h |
ds_zero2_full | 1024 | Yuan-2 2B | bf16/bf16 | 8*L20 | 1/128 | 15G | 1.3h |
ds_zero3_lora | 1024 | Yuan-2 51B | bf16/bf16 | 8*L20 | 1/128 | 43G | 18h |
ds_zero3_lora | 1024 | Yuan-2 102B | bf16/bf16 | 8*L20 | 1/128 | 42G | 40h |
Qlora | 1024 | Yuan-2 2B | int4/bf16 | 1*L20 | 1/16 | 4.5G | 3.4h |
以上测试使用52K条alpaca-samples,改造为单轮对话数据;epoch耗时为微调单个epoch的时间
基于yuan2.0微调完成的chat模型,使用fastchat可以非常方便的进行服务部署及使用。
- 命令行方式
使用N个GPU部署chat模型
python3 -m fastchat.serve.cli --model PATH-TO_CHATMODELS --num-gpus N
- WebGUI方式
python3 -m fastchat.serve.controller --host 0.0.0.0 &
python3 -m fastchat.serve.model_worker --model-path PATH-TO_CHATMODELS --host 0.0.0.0 &
#--gpus 0,1,2,3 --num-gpus 4 指定使用4个GPU加载模型进行推理
python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 映射的IP端口号
- OpenAI-Compatible RESTful APIs
python3 -m fastchat.serve.controller --host 0.0.0.0 &
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --model-path PATH-TO_CHATMODELS --host 0.0.0.0 --port 31000 --worker http://0.0.0.0:31000 &
python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 1234
#client 调用
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://0.0.0.0:1234/v1/"
model = "yuan2-2B"
prompt = "Once upon a time"
# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)
# create a chat completion
completion = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)
我们可以在langchain中使用OpenAI-Compatible RESTful APIs完成基于LLM的应用构建。