GitHub - HassanJbara/llm-detector-counter

Introduction

The goal of this project is implementing a method for fine-tuning LLMs to deceive any LLM detector using RL, a method that’s model- and detector-agnostic, meaning that it should theoretically work with any model and on any detector. The idea is to train an arbitrary LLM model to adapt its outputs in a way that deceives an arbitrary LLM detector using RL with the detector model as reward (punishment) model. Please take a look at the “Useful Links” and "Related Literature" sections for more on this topic.

Project Questions

How good would it be, if it works on one detector, on another detector?
How good are detectors really?
Would this make the LLM output more natural, more human like?

Scripts

The main training script is train_dpo.py and could be used as such:

python train_dpo.py \
        --dataset_name=hassanjbara/LONG-DPO \
        --model_name=mistralai/Mistral-Nemo-Instruct-2407 \
        --per_device_train_batch_size=1 \
        --learning_rate=1e-6 \
        --beta=0.6 \
        --gradient_accumulation_steps=8 \
        --warmup_steps=150 \
        --bf16 \
        --use_peft \
        --quantize \
        --num_train_epochs=1 \
        --dataset_train_split=1 \

The script also supports huggingface accelerate and could be used with the deepspeed configuration in the repository.

Useful Links

Runs

W&B Project

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
accelerate-config.yaml		accelerate-config.yaml
ds-config.json		ds-config.json
sweep.py		sweep.py
train.py		train.py
train_dpo.py		train_dpo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Project Questions

Scripts

Useful Links

Runs

Related Literature

About

Releases

Packages

Languages

License

HassanJbara/llm-detector-counter

Folders and files

Latest commit

History

Repository files navigation

Introduction

Project Questions

Scripts

Useful Links

Runs

Related Literature

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages