Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Intel/toxic-prompt-roberta to toxicity detection microservice #749

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions comps/guardrails/toxicity_detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@

Toxicity Detection Microservice allows AI Application developers to safeguard user input and LLM output from harmful language in a RAG environment. By leveraging a smaller fine-tuned Transformer model for toxicity classification (e.g. DistilledBERT, RoBERTa, etc.), we maintain a lightweight guardrails microservice without significantly sacrificing performance making it readily deployable on both Intel Gaudi and Xeon.

Toxicity is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

## Future Development
This microservice uses [`Intel/toxic-prompt-roberta`](https://huggingface.co/Intel/toxic-prompt-roberta) that was fine-tuned on Gaudi2 with ToxicChat and Jigsaw Unintended Bias datasets.

- Add a RoBERTa (125M params) toxicity model fine-tuned on Gaudi2 with ToxicChat and Jigsaw dataset in an optimized serving framework.
Toxicity is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

## 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -65,7 +63,7 @@ curl localhost:9091/v1/toxicity
Example Output:

```bash
"\nI'm sorry, but your query or LLM's response is TOXIC with an score of 0.97 (0-1)!!!\n"
"Violated policies: toxicity, please check your input."
```

**Python Script:**
Expand Down
4 changes: 2 additions & 2 deletions comps/guardrails/toxicity_detection/toxicity_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ def llm_generate(input: TextDoc):
input_text = input.text
toxic = toxicity_pipeline(input_text)
print("done")
if toxic[0]["label"] == "toxic":
if toxic[0]["label"].lower() == "toxic":
return TextDoc(text="Violated policies: toxicity, please check your input.", downstream_black_list=[".*"])
else:
return TextDoc(text=input_text)


if __name__ == "__main__":
model = "citizenlab/distilbert-base-multilingual-cased-toxicity"
model = "Intel/toxic-prompt-roberta"
toxicity_pipeline = pipeline("text-classification", model=model, tokenizer=model)
opea_microservices["opea_service@toxicity_detection"].start()
Loading