From f8f894786ea40fce998d1b56d66b62523f5e6d43 Mon Sep 17 00:00:00 2001 From: shibing624 Date: Mon, 28 Aug 2023 00:02:16 +0800 Subject: [PATCH] update dpo pynb --- README.md | 2 +- run_training_dpo_pipeline.ipynb | 712 ++++++++++++++++++++++++++++++++ run_training_pipeline.ipynb | 9 +- 3 files changed, 721 insertions(+), 2 deletions(-) create mode 100644 run_training_dpo_pipeline.ipynb diff --git a/README.md b/README.md index 7eee878..664a57f 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO( - 第一阶段:PT(Continue PreTraining)增量预训练,在海量领域文档数据上二次预训练GPT模型,以注入领域知识(可选) - 第二阶段:SFT(Supervised Fine-tuning)有监督微调,构造指令微调数据集,在预训练模型基础上做指令精调,以对齐指令意图 - 第三阶段 - - RLHF(Reinforcement Learning from Human Feedback)基于人类反馈对语言模型进行强化学习,分为两步:1)RM(Reward Model)奖励模型建模,构造人类偏好排序数据集,训练奖励模型,用来建模人类偏好,主要是"HHH"原则,具体是"helpful, honest, harmless";2)RL(Reinforcement Learning)强化学习,用奖励模型来训练SFT模型,生成模型使用奖励或惩罚来更新其策略,以便生成更高质量、更符合人类偏好的文本 + - RLHF(Reinforcement Learning from Human Feedback)基于人类反馈对语言模型进行强化学习,分为两步:1)RM(Reward Model)奖励模型建模,构造人类偏好排序数据集,训练奖励模型,用来建模人类偏好,主要是"HHH"原则,具体是"helpful, honest, harmless"; 2)RL(Reinforcement Learning)强化学习,用奖励模型来训练SFT模型,生成模型使用奖励或惩罚来更新其策略,以便生成更高质量、更符合人类偏好的文本 - [DPO(Direct Preference Optimization)](https://arxiv.org/pdf/2305.18290.pdf)直接偏好优化方法,DPO通过直接优化语言模型来实现对其行为的精确控制,而无需使用复杂的强化学习,也可以有效学习到人类偏好,DPO相较于RLHF更容易实现且易于训练,效果更好 diff --git a/run_training_dpo_pipeline.ipynb b/run_training_dpo_pipeline.ipynb new file mode 100644 index 0000000..df50c4c --- /dev/null +++ b/run_training_dpo_pipeline.ipynb @@ -0,0 +1,712 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Training Pipeline\n", + "[run_training_pipeline.ipynb](https://github.com/shibing624/MedicalGPT/blob/main/run_training_pipeline.ipynb) | [Open In Colab](https://colab.research.google.com/github/shibing624/MedicalGPT/blob/main/run_training_pipeline.ipynb)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "# Stage 1: Continue Pretraining\n", + "\n", + "第一阶段:PT(Continue PreTraining)增量预训练,在海量领域文本数据上二次预训练GPT模型,以注入领域知识\n", + "\n", + "| Stage 1: Continue Pretraining | [pretraining.py](https://github.com/shibing624/MedicalGPT/blob/main/pretraining.py) | [run_pt.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_pt.sh) |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 说明:\n", + "以下 notebook/colab 代码为了快速验证训练代码可用,我们使用了小size的生成模型和小样本数据集,实际使用时,需要使用更大的模型和数据集,以获得更好的效果。\n", + "\n", + "1. 生成模型:使用的是Bloom的`bigscience/bloomz-560m`\n", + "2. 数据集:PT阶段使用的是中文天龙八部小说部分文本和英文书籍部分文本,位于`data/pretrain`文件夹" + ] + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 配置运行环境\n", + "\n", + "本地执行可注释以下配置环境的命令,colab执行要打开注释,用于配置环境\n", + "\n", + "colab建议使用T4 GPU训练,设置方式:`代码执行程序 -> 更改运行时类型 -> 运行时类型:Python3,硬件加速器:GPU,GPU类型:T4 -> 保存`\n", + "\n", + "步骤:\n", + "1. 下载最新代码到本地\n", + "2. 安装依赖包\n", + "\n", + "依赖包如下,保证最新版本:\n", + "\n", + "```\n", + "loguru\n", + "transformers\n", + "sentencepiece\n", + "datasets\n", + "tensorboard\n", + "tqdm\n", + "peft\n", + "trl\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!git clone --depth 1 https://github.com/shibing624/MedicalGPT.git\n", + "%cd MedicalGPT\n", + "%ls\n", + "!pip install -r requirements.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Stage1 咱们开始吧\n", + "\n", + "训练步骤如下:\n", + "\n", + "1. 确认训练集\n", + "2. 执行训练脚本\n", + "\n", + "训练脚本的执行逻辑如下:\n", + "1. 导入依赖包\n", + "2. 设置参数\n", + "3. 定义各函数并加载训练集\n", + "4. 加载模型和tokenizer\n", + "5. 开始训练并评估\n", + "6. 查看训练结果\n", + "\n", + "**以下参数可以根据你的GPU实际情况修改,当前参数是根据Colab的T4单卡GPU(16GB显存)配置的**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%ls ./data/pretrain/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python pretraining.py \\\n", + " --model_type bloom \\\n", + " --model_name_or_path bigscience/bloomz-560m \\\n", + " --train_file_dir ./data/pretrain \\\n", + " --validation_file_dir ./data/pretrain \\\n", + " --per_device_train_batch_size 3 \\\n", + " --per_device_eval_batch_size 3 \\\n", + " --do_train \\\n", + " --do_eval \\\n", + " --use_peft True \\\n", + " --seed 42 \\\n", + " --fp16 \\\n", + " --max_train_samples 10000 \\\n", + " --max_eval_samples 10 \\\n", + " --num_train_epochs 1 \\\n", + " --learning_rate 2e-4 \\\n", + " --warmup_ratio 0.05 \\\n", + " --weight_decay 0.01 \\\n", + " --logging_strategy steps \\\n", + " --logging_steps 10 \\\n", + " --eval_steps 50 \\\n", + " --evaluation_strategy steps \\\n", + " --save_steps 500 \\\n", + " --save_strategy steps \\\n", + " --save_total_limit 3 \\\n", + " --gradient_accumulation_steps 1 \\\n", + " --preprocessing_num_workers 1 \\\n", + " --block_size 1024 \\\n", + " --output_dir outputs-pt-v1 \\\n", + " --overwrite_output_dir \\\n", + " --ddp_timeout 30000 \\\n", + " --logging_first_step True \\\n", + " --target_modules all \\\n", + " --lora_rank 8 \\\n", + " --lora_alpha 16 \\\n", + " --lora_dropout 0.05 \\\n", + " --torch_dtype float16 \\\n", + " --device_map auto \\\n", + " --report_to tensorboard \\\n", + " --ddp_find_unused_parameters False \\\n", + " --gradient_checkpointing True" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%ls -lh outputs-pt-v1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "模型训练结果:\n", + "- 使用lora训练模型,则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`,合并到base model的方法见`merge_peft_adapter.py`\n", + "- 日志保存在`output_dir/runs`目录下,可以使用tensorboard查看,启动tensorboard方式如下:`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`" + ] + }, + { + "cell_type": "markdown", + "source": [ + "lora模型权重合并到base model,合并后的模型保存在`--output_dir`目录下,合并方法如下:" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python merge_peft_adapter.py --model_type bloom \\\n", + " --base_model_name_or_path bigscience/bloomz-560m --peft_model_path outputs-pt-v1 --output_dir merged-pt/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls -lh merged-pt/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%cat merged-pt/config.json" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Stage1 增量预训练完成。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "start_time": "2023-06-15T13:56:17.032821Z", + "end_time": "2023-06-15T13:56:17.081153Z" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Stage 2: Supervised FineTuning\n", + "\n", + "第二阶段:SFT(Supervised Fine-tuning)有监督微调,构造指令微调数据集,在预训练模型基础上做指令精调,以对齐指令意图\n", + "\n", + "| Stage 2: Supervised Fine-tuning | [supervised_finetuning.py](https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py) | [run_sft.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_sft.sh) |" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "#### 说明:\n", + "以下 notebook/colab 代码为了快速验证训练代码可用,我们使用了小size的生成模型和小样本数据集,实际使用时,需要使用更大的模型和数据集,以获得更好的效果。\n", + "\n", + "1. 生成模型:使用的是Bloom的`bigscience/bloomz-560m` 或者 Stage1得到的预训练模型\n", + "2. 数据集:SFT阶段使用的是使用的是Belle的1千条抽样数据,位于`data/finetune`文件夹" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Stage2 咱们开始吧\n", + "\n", + "训练步骤如下:\n", + "\n", + "1. 确认训练集\n", + "2. 执行训练脚本\n", + "\n", + "训练脚本的执行逻辑如下:\n", + "1. 导入依赖包\n", + "2. 设置参数\n", + "3. 定义各函数并加载训练集\n", + "4. 加载模型和tokenizer\n", + "5. 开始训练并评估\n", + "6. 查看训练结果" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls ./data/finetune" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "start_time": "2023-06-15T13:58:38.778132Z", + "end_time": "2023-06-15T13:58:38.966506Z" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python supervised_finetuning.py \\\n", + " --model_type bloom \\\n", + " --model_name_or_path merged-pt \\\n", + " --train_file_dir ./data/finetune \\\n", + " --validation_file_dir ./data/finetune \\\n", + " --per_device_train_batch_size 4 \\\n", + " --per_device_eval_batch_size 4 \\\n", + " --do_train \\\n", + " --do_eval \\\n", + " --use_peft True \\\n", + " --fp16 \\\n", + " --max_train_samples 1000 \\\n", + " --max_eval_samples 10 \\\n", + " --num_train_epochs 1 \\\n", + " --learning_rate 2e-5 \\\n", + " --warmup_ratio 0.05 \\\n", + " --weight_decay 0.05 \\\n", + " --logging_strategy steps \\\n", + " --logging_steps 10 \\\n", + " --eval_steps 50 \\\n", + " --evaluation_strategy steps \\\n", + " --save_steps 500 \\\n", + " --save_strategy steps \\\n", + " --save_total_limit 3 \\\n", + " --gradient_accumulation_steps 1 \\\n", + " --preprocessing_num_workers 1 \\\n", + " --output_dir outputs-sft-v1 \\\n", + " --overwrite_output_dir \\\n", + " --ddp_timeout 30000 \\\n", + " --logging_first_step True \\\n", + " --target_modules all \\\n", + " --lora_rank 8 \\\n", + " --lora_alpha 16 \\\n", + " --lora_dropout 0.05 \\\n", + " --torch_dtype float16 \\\n", + " --device_map auto \\\n", + " --report_to tensorboard \\\n", + " --ddp_find_unused_parameters False \\\n", + " --gradient_checkpointing True" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls -lh outputs-sft-v1" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "模型训练结果:\n", + "- 使用lora训练模型,则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`,合并到base model的方法见`merge_peft_adapter.py`\n", + "- 日志保存在`output_dir/runs`目录下,可以使用tensorboard查看,启动tensorboard方式如下:`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "lora模型权重合并到base model,合并后的模型保存在`--output_dir`目录下,合并方法如下:" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python merge_peft_adapter.py --model_type bloom \\\n", + " --base_model_name_or_path merged-pt --peft_model_path outputs-sft-v1 --output_dir merged-sft/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls -lh merged-sft/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%cat merged-sft/config.json" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Stage2 SFT训练完成。" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "start_time": "2023-06-15T14:07:40.731186Z", + "end_time": "2023-06-15T14:07:40.752635Z" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "# Stage 3: DPO(Direct Preference Optimization)\n", + "\n", + "第三阶段:DPO(Direct Preference Optimization)直接偏好优化,DPO通过直接优化语言模型来实现对其行为的精确控制,而无需使用复杂的强化学习,也可以有效学习到人类偏好,DPO相较于RLHF更容易实现且易于训练,效果更好\n", + "\n", + "| Stage 3: Direct Preference Optimization | [dpo_training.py](https://github.com/shibing624/MedicalGPT/blob/main/dpo_training.py) | [run_dpo.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_dpo.sh) |" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "#### 说明:\n", + "以下 notebook/colab 代码为了快速验证训练代码可用,我们使用了小size的生成模型和小样本数据集,实际使用时,需要使用更大的模型和数据集,以获得更好的效果。\n", + "\n", + "1. 生成模型:使用的是Bloom的`bigscience/bloomz-560m` 或者 Stage2得到的SFT模型\n", + "2. 数据集:RM阶段使用的是医疗reward数据,抽样了500条,位于`data/reward`文件夹" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Stage3 咱们开始吧\n", + "\n", + "训练步骤如下:\n", + "\n", + "1. 确认训练集\n", + "2. 执行训练脚本\n", + "\n", + "训练脚本的执行逻辑如下:\n", + "1. 导入依赖包\n", + "2. 设置参数\n", + "3. 定义各函数并加载训练集\n", + "4. 加载模型和tokenizer\n", + "5. 开始训练并评估\n", + "6. 查看训练结果" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls ./data/reward/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python reward_modeling.py \\\n", + " --model_type bloom \\\n", + " --model_name_or_path merged-sft \\\n", + " --train_file_dir ./data/reward \\\n", + " --validation_file_dir ./data/reward \\\n", + " --per_device_train_batch_size 3 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --do_train \\\n", + " --do_eval \\\n", + " --use_peft True \\\n", + " --max_train_samples 1000 \\\n", + " --max_eval_samples 10 \\\n", + " --max_steps 500 \\\n", + " --eval_steps 50 \\\n", + " --save_steps 50 \\\n", + " --eval_strategy steps \\\n", + " --max_source_length 128 \\\n", + " --max_target_length 128 \\\n", + " --output_dir outputs-dpo-v1 \\\n", + " --target_modules all \\\n", + " --lora_rank 8 \\\n", + " --lora_alpha 16 \\\n", + " --lora_dropout 0.05 \\\n", + " --torch_dtype float16 \\\n", + " --fp16 True \\\n", + " --device_map auto \\\n", + " --report_to tensorboard \\\n", + " --remove_unused_columns False \\\n", + " --gradient_checkpointing True \\\n", + " --cache_dir ./cache" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls -lh outputs-dpo-v1" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "模型训练结果:\n", + "- 使用lora训练模型,则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`,合并到base model的方法见`merge_peft_adapter.py`\n", + "- 日志保存在`output_dir/runs`目录下,可以使用tensorboard查看,启动tensorboard方式如下:`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "lora模型权重合并到base model,合并后的模型保存在`--output_dir`目录下,合并方法如下:" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python merge_peft_adapter.py --model_type bloom \\\n", + " --base_model_name_or_path merged-sft --peft_model_path outputs-dpo-v1 --output_dir merged-dpo/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%ls -lh merged-dpo/" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%cat merged-dpo/config.json" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "Stage3 偏好建模第一次训练完成。" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "**至此一个完整的训练流程演示完成。**" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "start_time": "2023-06-26T12:34:29.620609Z", + "end_time": "2023-06-26T12:34:29.658428Z" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "# Test" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!python inference.py --model_type bloom --base_model merged-dpo --interactive" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "start_time": "2023-06-26T12:34:47.802087Z", + "end_time": "2023-06-26T12:35:00.864463Z" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Input:介绍下南京\n", + "Response: 南京市位于江苏省西南部,是全国首批历史文化名城、国家中心城市和自由贸易试验区。\n", + "\n", + "完。\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "f34eed0bebedfc4b6ee51ced43d2c030fe3b92f13c149d072205ca200a67b1ec" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/run_training_pipeline.ipynb b/run_training_pipeline.ipynb index 6b7060d..aa89656 100644 --- a/run_training_pipeline.ipynb +++ b/run_training_pipeline.ipynb @@ -844,12 +844,19 @@ "collapsed": false } }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false + } + }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ - "!python inference.py --model_type bloom --base_model merged-rl --with_prompt --interactive" + "!python inference.py --model_type bloom --base_model merged-rl --interactive" ], "metadata": { "collapsed": false,