Skip to content

Latest commit

 

History

History
250 lines (216 loc) · 14.1 KB

README.md

File metadata and controls

250 lines (216 loc) · 14.1 KB

FuzzyAI Fuzzer

Project Logo


The FuzzyAI Fuzzer is a powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify jailbreaks and mitigate potential security vulnerabilities in their LLM APIs.

FZAI

Features

  • Fuzzing Techniques: The FuzzyAI Fuzzer supports various fuzzing techniques, including mutation-based fuzzing, generation-based fuzzing, and intelligent fuzzing.
  • Input Generation: It provides built-in input generation capabilities to generate valid and invalid inputs for testing.
  • Integration: The FuzzyAI Fuzzer can be easily integrated into existing development and testing workflows.
  • Extensibility: It provides an extensible architecture, allowing users to customize and extend the fuzzer's functionality.

Attacks we already implemented

Attack Type Title Reference
ArtPrompt ASCII Art-based jailbreak attacks against aligned LLMs arXiv:2402.11753
Taxonomy-based paraphrasing Uses persuasive language techniques like emotional appeal and social proof to jailbreak LLMs arXiv:2401.06373
PAIR (Prompt Automatic Iterative Refinement) Automates the generation of adversarial prompts by pairing two LLMs (“attacker” and “target”) to iteratively refine prompts until achieving jailbreak arXiv:2310.08419
Many-shot jailbreaking Exploits large context windows in language models by embedding multiple fake dialogue examples, gradually weakening the model's safety responses Anthropic Research
Genetic Genetic algorithm iteratively modifies prompts to generate an adversarial suffix that coerces large language models into producing restricted content. arXiv:2309.01446
Hallucinations Using Hallucinations to Bypass RLHF Filters arXiv:2403.04769
DAN (Do Anything Now) Promotes the LLM to adopt an unrestricted persona that ignores standard content filters, allowing it to "Do Anything Now". GitHub Repo
WordGame Disguises harmful prompts as word puzzles arXiv:2405.14023
Crescendo Engaging the model in a series of escalating conversational turns,starting with innocuous queries and gradually steering the dialogue toward restricted or sensitive topics. arXiv:2404.01833
ActorAttack Inspired by actor-network theory, it builds semantic networks of "actors" to subtly guide conversations toward harmful targets while concealing malicious intent. arxiv 2410.10700
Back To The Past Modifies the prompt by adding a profession-based prefix and a past-related suffix
Please Modifies the prompt by adding please as a prefix and suffix
Thought Experiment Modifies the prompt by adding a thought experiment-related prefix. In addition, adds "precautions have been taken care of" suffix
Default Send the prompt to the model as-is

Supported models

We've tested the attacks above using the models listed below. FuzzyAI is programmed to support a definitive set of model types, but you can always add support for new models by adding the model's name to the relevant LLM provider.

Provider Models
Anthropic
  • claude-3-5-sonnet-latest
  • claude-3-opus-latest
  • claude-3-haiku-20240307
  • claude-2.1
OpenAI
  • o1-preview
  • o1-mini
  • gpt-4o
  • gpt-4-turbo
  • gpt-4
  • gpt-3.5-turbo
Gemini
  • gemini-1.5-pro
  • gemini-pro
Azure
  • gpt-4o
  • gpt-4
  • gpt-35-turbo
Bedrock
  • anthropic.claude-3-5-sonnet-20241022-v2:0
  • anthropic.claude-3-sonnet-20240229-v1:0
  • anthropic.claude-3-opus-20240229-v1:0
  • anthropic.claude-3-haiku-20240307-v1:0
  • anthropic.claude-v2:1
AI21
  • jamba-1.5-mini
  • jamba-1.5-large
Ollama
  • llama3.2
  • llama3.1
  • dolphin-llama3
  • llama3
  • llama2:70b
  • llama2-uncensored
  • llama2
  • vicuna
  • gemma2
  • gemma
  • phi3
  • phi
  • mistral
  • mixtral
  • qwen
  • zephyr

Adding support for newer models

Please note that we have specified the models tested for cloud API. If you attempt to use a model not listed, you will receive an error indicating that the provider does not support that model. However, you can add the model to the implementation's list of supported models, and it will then function as expected.

Supported Cloud APIs

  • OpenAI
  • Anthropic
  • Gemini
  • Azure Cloud
  • AWS Bedrock
  • AI21
  • Huggingface (Downloading models)
  • Ollama
  • Custom REST API

Datasets

We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.

File name Description
pandoras_prompts.txt Harmful prompts
adv_prompts.txt Harmful prompts
benign_prompts.txt Regular prompts
history_prompts.txt Harmful prompts phrased as in "Back To The Past" attack
harmful_behaviors.csv Harmful prompts
adv_suffixes.txt Random prompt suffixes
alpaca_data_instructions.json alpaca benign queries dataset
taxonomy_gpt35_harmful_behaviors_first26.json persuasive prompts
finetuned_summarizer_train_dataset.jsonl Dataset used to train a GPT fine-tuned summarizer (See Paper page 20)

Documentation

For more detailed instructions, please refer to the documentation.

Getting Started

To get started with the FuzzyAI Fuzzer, follow these steps:

  1. Clone the repository: git clone [email protected]:cyberark/FuzzyAI.git
  2. Install the required dependencies using poetry : poetry update
    this will create a venv, if you're not using an IDE, make sure you activate it by invoking poetry shell
  3. Run the fuzzer: python run.py -h

Usage

Just run the following and follow the flags

python run.py -h

Note: To run models using OLLAMA, make sure it is installed or download it here. Follow the instructions to pull the desired model.

Examples

A default attack (just evaluate the prompt 'Harmful_Prompt') using mistral over Ollama

python run.py -m ollama/mistral -a def -t "Harmful_Prompt"

Attack the prompt utilizing ManyShot and Taxonomy attacks, using gpt3.5 over OpenAI using one worker

OPENAI_API_KEY=sk-3fa1... python run.py -m openai/gpt-3.5-turbo -a man -a tax -t "Harmful_Prompt"

Attack prompts in prompts.txt utilizing the ArtPrompt attack, using Claude 3.5 over Anthropic

ANTHROPIC_API_KEY=deadbeef... python run.py -m anthropic/claude-3-5-sonnet-20240620 -a art -T resources/adv_prompts.txt -e blacklisted_words="bomb,garbage,acid,ugly,self-harm,kill your"

Persisting Your Settings

To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:

{
  "model": [
    "ollama/mistral"
  ],
  "attack_modes": [
    "def",
    "art"
  ],
  "classifier": [
    "har"
  ],
  "extra": [
    "blacklisted_words=acid"
  ]
}

Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev

python run.py -C config_example.json -t "Harmful_Prompt"

Caveats

  • Some classifiers do more than just evaluate a single output. For example, the cosine-similarity classifier compares two outputs by measuring the angle between them, while a 'harmfulness' classifier checks whether a given output is harmful. As a result, not all classifiers are compatible with the attack methods we've implemented, as those methods are designed for single-output classifiers.
  • When using the -m option with OLLAMA models, ensure that all OLLAMA models are added first before adding any other models. Use the -e port=... option to specify the port number for OLLAMA (default is 11434).

Contributing

Contributions are welcome! If you would like to contribute to the FuzzyAI Fuzzer, please follow the guidelines outlined in the CONTRIBUTING.md file.

License

The FuzzyAI Fuzzer is released under the Apache License. See the LICENSE file for more details.

Contact

If you have any questions or suggestions regarding the FuzzyAI Fuzzer, please feel free to contact us at [email protected].