The goal of this project is implementing a method for fine-tuning LLMs to deceive any LLM detector using RL, a method that’s model- and detector-agnostic, meaning that it should theoretically work with any model and on any detector. The idea is to train an arbitrary LLM model to adapt its outputs in a way that deceives an arbitrary LLM detector using RL with the detector model as reward (punishment) model. Please take a look at the “Useful Links” and "Related Literature" sections for more on this topic.
- How good would it be, if it works on one detector, on another detector?
- How good are detectors really?
- Would this make the LLM output more natural, more human like?
The main training script is train_dpo.py
and could be used as such:
python train_dpo.py \
--dataset_name=hassanjbara/LONG-DPO \
--model_name=mistralai/Mistral-Nemo-Instruct-2407 \
--per_device_train_batch_size=1 \
--learning_rate=1e-6 \
--beta=0.6 \
--gradient_accumulation_steps=8 \
--warmup_steps=150 \
--bf16 \
--use_peft \
--quantize \
--num_train_epochs=1 \
--dataset_train_split=1 \
The script also supports huggingface accelerate and could be used with the deepspeed configuration in the repository.
- Are LLMs the Beginning or End of NLP?
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- RLHF: Reinforcement Learning from Human Feedback | by Ms Aerin | Oct, 2023 | Towards Data Science
- TRL - Transformer Reinforcement Learning (how-to guides)
- Teach Llamas to Talk: Recent Progress in Instruction Tuning
- huggingface/alignment-handbook: Robust recipes for to align language models with human and AI preferences