FuzzyAI Fuzzer

The FuzzyAI Fuzzer is a powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify jailbreaks and mitigate potential security vulnerabilities in their LLM APIs.

Features

Fuzzing Techniques: The FuzzyAI Fuzzer supports various fuzzing techniques, including mutation-based fuzzing, generation-based fuzzing, and intelligent fuzzing.
Input Generation: It provides built-in input generation capabilities to generate valid and invalid inputs for testing.
Integration: The FuzzyAI Fuzzer can be easily integrated into existing development and testing workflows.
Extensibility: It provides an extensible architecture, allowing users to customize and extend the fuzzer's functionality.

Attacks we already implemented

Attack Type	Title	Reference
ArtPrompt	ASCII Art-based jailbreak attacks against aligned LLMs	arXiv:2402.11753
Taxonomy-based paraphrasing	Uses persuasive language techniques like emotional appeal and social proof to jailbreak LLMs	arXiv:2401.06373
PAIR (Prompt Automatic Iterative Refinement)	Automates the generation of adversarial prompts by pairing two LLMs (“attacker” and “target”) to iteratively refine prompts until achieving jailbreak	arXiv:2310.08419
Many-shot jailbreaking	Exploits large context windows in language models by embedding multiple fake dialogue examples, gradually weakening the model's safety responses	Anthropic Research
Genetic	Genetic algorithm iteratively modifies prompts to generate an adversarial suffix that coerces large language models into producing restricted content.	arXiv:2309.01446
Hallucinations	Using Hallucinations to Bypass RLHF Filters	arXiv:2403.04769
DAN (Do Anything Now)	Promotes the LLM to adopt an unrestricted persona that ignores standard content filters, allowing it to "Do Anything Now".	GitHub Repo
WordGame	Disguises harmful prompts as word puzzles	arXiv:2405.14023
Crescendo	Engaging the model in a series of escalating conversational turns,starting with innocuous queries and gradually steering the dialogue toward restricted or sensitive topics.	arXiv:2404.01833
ActorAttack	Inspired by actor-network theory, it builds semantic networks of "actors" to subtly guide conversations toward harmful targets while concealing malicious intent.	arxiv 2410.10700
Back To The Past	Modifies the prompt by adding a profession-based prefix and a past-related suffix
Please	Modifies the prompt by adding please as a prefix and suffix
Thought Experiment	Modifies the prompt by adding a thought experiment-related prefix. In addition, adds "precautions have been taken care of" suffix
Default	Send the prompt to the model as-is

Supported models

We've tested the attacks above using the models listed below. FuzzyAI is programmed to support a definitive set of model types, but you can always add support for new models by adding the model's name to the relevant LLM provider.

Provider	Models
Anthropic	claude-3-5-sonnet-latest claude-3-opus-latest claude-3-haiku-20240307 claude-2.1
OpenAI	o1-preview o1-mini gpt-4o gpt-4-turbo gpt-4 gpt-3.5-turbo
Gemini	gemini-1.5-pro gemini-pro
Azure	gpt-4o gpt-4 gpt-35-turbo
Bedrock	anthropic.claude-3-5-sonnet-20241022-v2:0 anthropic.claude-3-sonnet-20240229-v1:0 anthropic.claude-3-opus-20240229-v1:0 anthropic.claude-3-haiku-20240307-v1:0 anthropic.claude-v2:1
AI21	jamba-1.5-mini jamba-1.5-large
Ollama	llama3.2 llama3.1 dolphin-llama3 llama3 llama2:70b llama2-uncensored llama2 vicuna gemma2 gemma phi3 phi mistral mixtral qwen zephyr

Adding support for newer models

Please note that we have specified the models tested for cloud API. If you attempt to use a model not listed, you will receive an error indicating that the provider does not support that model. However, you can add the model to the implementation's list of supported models, and it will then function as expected.

Supported Cloud APIs

OpenAI
Anthropic
Gemini
Azure Cloud
AWS Bedrock
AI21
Huggingface (Downloading models)
Ollama
Custom REST API

Datasets

We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.

File name	Description
pandoras_prompts.txt	Harmful prompts
adv_prompts.txt	Harmful prompts
benign_prompts.txt	Regular prompts
history_prompts.txt	Harmful prompts phrased as in "Back To The Past" attack
harmful_behaviors.csv	Harmful prompts
adv_suffixes.txt	Random prompt suffixes
alpaca_data_instructions.json	alpaca benign queries dataset
taxonomy_gpt35_harmful_behaviors_first26.json	persuasive prompts
finetuned_summarizer_train_dataset.jsonl	Dataset used to train a GPT fine-tuned summarizer (See Paper page 20)

Documentation

For more detailed instructions, please refer to the documentation.

Getting Started

To get started with the FuzzyAI Fuzzer, follow these steps:

Clone the repository: git clone git@github.com:cyberark/FuzzyAI.git
Install the required dependencies using poetry : poetry update
this will create a venv, if you're not using an IDE, make sure you activate it by invoking poetry shell
Run the fuzzer: python run.py -h

Usage

Just run the following and follow the flags

python run.py -h

Note: To run models using OLLAMA, make sure it is installed or download it here. Follow the instructions to pull the desired model.

Examples

A default attack (just evaluate the prompt 'Harmful_Prompt') using mistral over Ollama

python run.py -m ollama/mistral -a def -t "Harmful_Prompt"

Attack the prompt utilizing ManyShot and Taxonomy attacks, using gpt3.5 over OpenAI using one worker

OPENAI_API_KEY=sk-3fa1... python run.py -m openai/gpt-3.5-turbo -a man -a tax -t "Harmful_Prompt"

Attack prompts in prompts.txt utilizing the ArtPrompt attack, using Claude 3.5 over Anthropic

ANTHROPIC_API_KEY=deadbeef... python run.py -m anthropic/claude-3-5-sonnet-20240620 -a art -T resources/adv_prompts.txt -e blacklisted_words="bomb,garbage,acid,ugly,self-harm,kill your"

Persisting Your Settings

To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:

{
  "model": [
    "ollama/mistral"
  ],
  "attack_modes": [
    "def",
    "art"
  ],
  "classifier": [
    "har"
  ],
  "extra": [
    "blacklisted_words=acid"
  ]
}

Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev

python run.py -C config_example.json -t "Harmful_Prompt"

Caveats

Some classifiers do more than just evaluate a single output. For example, the cosine-similarity classifier compares two outputs by measuring the angle between them, while a 'harmfulness' classifier checks whether a given output is harmful. As a result, not all classifiers are compatible with the attack methods we've implemented, as those methods are designed for single-output classifiers.
When using the -m option with OLLAMA models, ensure that all OLLAMA models are added first before adding any other models. Use the -e port=... option to specify the port number for OLLAMA (default is 11434).

Contributing

Contributions are welcome! If you would like to contribute to the FuzzyAI Fuzzer, please follow the guidelines outlined in the CONTRIBUTING.md file.

License

The FuzzyAI Fuzzer is released under the Apache License. See the LICENSE file for more details.

Contact

If you have any questions or suggestions regarding the FuzzyAI Fuzzer, please feel free to contact us at fzai@cyberark.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FuzzyAI Fuzzer

Features

Attacks we already implemented

Supported models

Adding support for newer models

Supported Cloud APIs

Datasets

Documentation

Getting Started

Usage

Examples

A default attack (just evaluate the prompt 'Harmful_Prompt') using mistral over Ollama

Attack the prompt utilizing ManyShot and Taxonomy attacks, using gpt3.5 over OpenAI using one worker

Attack prompts in prompts.txt utilizing the ArtPrompt attack, using Claude 3.5 over Anthropic

Persisting Your Settings

Caveats

Contributing

License

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

FuzzyAI Fuzzer

Features

Attacks we already implemented

Supported models

Adding support for newer models

Supported Cloud APIs

Datasets

Documentation

Getting Started

Usage

Examples

A default attack (just evaluate the prompt 'Harmful_Prompt') using mistral over Ollama

Attack the prompt utilizing ManyShot and Taxonomy attacks, using gpt3.5 over OpenAI using one worker

Attack prompts in prompts.txt utilizing the ArtPrompt attack, using Claude 3.5 over Anthropic

Persisting Your Settings

Caveats

Contributing

License

Contact