The FuzzyAI Fuzzer is a powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify jailbreaks and mitigate potential security vulnerabilities in their LLM APIs.
- Fuzzing Techniques: The FuzzyAI Fuzzer supports various fuzzing techniques, including mutation-based fuzzing, generation-based fuzzing, and intelligent fuzzing.
- Input Generation: It provides built-in input generation capabilities to generate valid and invalid inputs for testing.
- Integration: The FuzzyAI Fuzzer can be easily integrated into existing development and testing workflows.
- Extensibility: It provides an extensible architecture, allowing users to customize and extend the fuzzer's functionality.
Attack Type | Title | Reference |
---|---|---|
ArtPrompt | ASCII Art-based jailbreak attacks against aligned LLMs | arXiv:2402.11753 |
Taxonomy-based paraphrasing | Uses persuasive language techniques like emotional appeal and social proof to jailbreak LLMs | arXiv:2401.06373 |
PAIR (Prompt Automatic Iterative Refinement) | Automates the generation of adversarial prompts by pairing two LLMs (“attacker” and “target”) to iteratively refine prompts until achieving jailbreak | arXiv:2310.08419 |
Many-shot jailbreaking | Exploits large context windows in language models by embedding multiple fake dialogue examples, gradually weakening the model's safety responses | Anthropic Research |
Genetic | Genetic algorithm iteratively modifies prompts to generate an adversarial suffix that coerces large language models into producing restricted content. | arXiv:2309.01446 |
Hallucinations | Using Hallucinations to Bypass RLHF Filters | arXiv:2403.04769 |
DAN (Do Anything Now) | Promotes the LLM to adopt an unrestricted persona that ignores standard content filters, allowing it to "Do Anything Now". | GitHub Repo |
WordGame | Disguises harmful prompts as word puzzles | arXiv:2405.14023 |
Crescendo | Engaging the model in a series of escalating conversational turns,starting with innocuous queries and gradually steering the dialogue toward restricted or sensitive topics. | arXiv:2404.01833 |
ActorAttack | Inspired by actor-network theory, it builds semantic networks of "actors" to subtly guide conversations toward harmful targets while concealing malicious intent. | arxiv 2410.10700 |
Back To The Past | Modifies the prompt by adding a profession-based prefix and a past-related suffix | |
Please | Modifies the prompt by adding please as a prefix and suffix | |
Thought Experiment | Modifies the prompt by adding a thought experiment-related prefix. In addition, adds "precautions have been taken care of" suffix | |
Default | Send the prompt to the model as-is |
We've tested the attacks above using the models listed below. FuzzyAI is programmed to support a definitive set of model types, but you can always add support for new models by adding the model's name to the relevant LLM provider.
Provider | Models |
---|---|
Anthropic |
|
OpenAI |
|
Gemini |
|
Azure |
|
Bedrock |
|
AI21 |
|
Ollama |
|
Please note that we have specified the models tested for cloud API. If you attempt to use a model not listed, you will receive an error indicating that the provider does not support that model. However, you can add the model to the implementation's list of supported models, and it will then function as expected.
- OpenAI
- Anthropic
- Gemini
- Azure Cloud
- AWS Bedrock
- AI21
- Huggingface (Downloading models)
- Ollama
- Custom REST API
We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.
File name | Description |
---|---|
pandoras_prompts.txt | Harmful prompts |
adv_prompts.txt | Harmful prompts |
benign_prompts.txt | Regular prompts |
history_prompts.txt | Harmful prompts phrased as in "Back To The Past" attack |
harmful_behaviors.csv | Harmful prompts |
adv_suffixes.txt | Random prompt suffixes |
alpaca_data_instructions.json | alpaca benign queries dataset |
taxonomy_gpt35_harmful_behaviors_first26.json | persuasive prompts |
finetuned_summarizer_train_dataset.jsonl | Dataset used to train a GPT fine-tuned summarizer (See Paper page 20) |
For more detailed instructions, please refer to the documentation.
To get started with the FuzzyAI Fuzzer, follow these steps:
- Clone the repository:
git clone [email protected]:cyberark/FuzzyAI.git
- Install the required dependencies using poetry :
poetry update
this will create a venv, if you're not using an IDE, make sure you activate it by invokingpoetry shell
- Run the fuzzer:
python run.py -h
Just run the following and follow the flags
python run.py -h
Note: To run models using OLLAMA, make sure it is installed or download it here. Follow the instructions to pull the desired model.
python run.py -m ollama/mistral -a def -t "Harmful_Prompt"
Attack the prompt utilizing ManyShot and Taxonomy attacks, using gpt3.5 over OpenAI using one worker
OPENAI_API_KEY=sk-3fa1... python run.py -m openai/gpt-3.5-turbo -a man -a tax -t "Harmful_Prompt"
ANTHROPIC_API_KEY=deadbeef... python run.py -m anthropic/claude-3-5-sonnet-20240620 -a art -T resources/adv_prompts.txt -e blacklisted_words="bomb,garbage,acid,ugly,self-harm,kill your"
To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:
{
"model": [
"ollama/mistral"
],
"attack_modes": [
"def",
"art"
],
"classifier": [
"har"
],
"extra": [
"blacklisted_words=acid"
]
}
Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev
python run.py -C config_example.json -t "Harmful_Prompt"
- Some classifiers do more than just evaluate a single output. For example, the cosine-similarity classifier compares two outputs by measuring the angle between them, while a 'harmfulness' classifier checks whether a given output is harmful. As a result, not all classifiers are compatible with the attack methods we've implemented, as those methods are designed for single-output classifiers.
- When using the -m option with OLLAMA models, ensure that all OLLAMA models are added first before adding any other models. Use the -e port=... option to specify the port number for OLLAMA (default is 11434).
Contributions are welcome! If you would like to contribute to the FuzzyAI Fuzzer, please follow the guidelines outlined in the CONTRIBUTING.md file.
The FuzzyAI Fuzzer is released under the Apache License. See the LICENSE file for more details.
If you have any questions or suggestions regarding the FuzzyAI Fuzzer, please feel free to contact us at [email protected].