Skip to content

HassanJbara/llm-detector-counter

Repository files navigation

Introduction

Blog Post

The goal of this project is implementing a method for fine-tuning LLMs to deceive any LLM detector using RL, a method that’s model- and detector-agnostic, meaning that it should theoretically work with any model and on any detector. The idea is to train an arbitrary LLM model to adapt its outputs in a way that deceives an arbitrary LLM detector using RL with the detector model as reward (punishment) model. Please take a look at the “Useful Links” and "Related Literature" sections for more on this topic.

Project Questions

  1. How good would it be, if it works on one detector, on another detector?
  2. How good are detectors really?
  3. Would this make the LLM output more natural, more human like?

Scripts

The main training script is train_dpo.py and could be used as such:

python train_dpo.py \
        --dataset_name=hassanjbara/LONG-DPO \
        --model_name=mistralai/Mistral-Nemo-Instruct-2407 \
        --per_device_train_batch_size=1 \
        --learning_rate=1e-6 \
        --beta=0.6 \
        --gradient_accumulation_steps=8 \
        --warmup_steps=150 \
        --bf16 \
        --use_peft \
        --quantize \
        --num_train_epochs=1 \
        --dataset_train_split=1 \

The script also supports huggingface accelerate and could be used with the deepspeed configuration in the repository.

Useful Links

Runs

W&B Project

Related Literature

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages