Skip to content

Commit

Permalink
Add ML detection strategy to PII detection guardrail (#292)
Browse files Browse the repository at this point in the history
* first code of business safety classifier

* allow strategy options and update readme

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable ray for ml strategy and update test script

* add log to pii test

Signed-off-by: minmin-intel <[email protected]>

* update logging in test gaurdrail

* rm llm strategy and change url in test

Signed-off-by: minmin-intel <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete file check in test and update readme

Signed-off-by: minmin-intel <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: minmin-intel <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: chen, suyue <[email protected]>
  • Loading branch information
3 people authored Aug 1, 2024
1 parent ee5b0f6 commit de27e6b
Show file tree
Hide file tree
Showing 6 changed files with 109 additions and 65 deletions.
42 changes: 37 additions & 5 deletions comps/guardrails/pii_detection/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,31 @@
# PII Detection Microservice

PII Detection a method to detect Personal Identifiable Information in text. This microservice provides users a unified API to either upload your files or send a list of text, and return with a list following original sequence of labels marking if it contains PII or not.
This microservice provides a unified API to detect if there is Personal Identifiable Information or Business Sensitive Information in text.

We provide 2 detection strategies:

1. Regular expression matching + named entity recognition (NER) - pass "ner" as strategy in your request to the microservice.
2. Logistic regression classifier - pass "ml" as strategy in your request to the microservice. **Note**: Currently this strategy is for demo only, and only supports using `nomic-ai/nomic-embed-text-v1` as the embedding model and the `Intel/business_safety_logistic_regression_classifier` model as the classifier. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier) before using this strategy.

## NER strategy

We adopted the [pii detection code](https://github.com/bigcode-project/bigcode-dataset/tree/main/pii) of the [BigCode](https://www.bigcode-project.org/) project and use the bigcode/starpii model for NER. Currently this strategy can detect IP address, emails, phone numbers, alphanumeric keys, names and passwords. The IP address, emails, phone numbers, alphanumeric keys are detected with regular expression matching. The names and passwords are detected with NER. Please refer to the starpii [model card](https://huggingface.co/bigcode/starpii) for more information of the detection performance.

## ML strategy

We have trained a classifier model using the [Patronus EnterprisePII dataset](https://www.patronus.ai/announcements/patronus-ai-launches-enterprisepii-the-industrys-first-llm-dataset-for-detecting-business-sensitive-information) for the demo purpose only. Please note that the demo model has not been extensively tested so is not intended for use in production environment. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier).

The classifiler model is used together with an embedding model to make predictions. The embedding model used for demo is `nomic-ai/nomic-embed-text-v1` [model](https://blog.nomic.ai/posts/nomic-embed-text-v1) available on Huggingface hub. We picked this open-source embedding model for demo as it is one of the top-performing long-context (max sequence length = 8192 vs. 512 for other BERT-based encoders) encoder models that do well on [Huggingface MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) as well as long-context [LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval). The long-context capability is useful when the text is long (>512 tokens).

Currently this strategy can detect both personal sensitive and business sensitive information such as financial figures and performance reviews. Please refer to the [model card](<(https://huggingface.co/Intel/business_safety_logistic_regression_classifier)>) to see the performance of our demo model on the Patronus EnterprisePII dataset.

# Input and output

Users can send a list of files, a list of text strings, or a list of urls to the microservice, and the microservice will return a list of True or False for each piece of text following the original sequence.

For a concrete example of what input should look like, please refer to [Consume Microservice](#4-consume-microservice) section below.

The output will be a list of booleans, which can be parsed and used as conditions in a bigger application.

# 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -62,11 +87,18 @@ import requests
import json

proxies = {"http": ""}
url = "http://localhost:6357/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
url = "http://localhost:6357/v1/piidetect"

strategy = "ml" # options: "ner", "ml"
content = [
"Q1 revenue was $1.23 billion, up 12% year over year. ",
"We are excited to announce the opening of our new office in Miami! ",
"Mary Smith, 123-456-7890,",
"John is a good team leader",
"meeting minutes: sync up with sales team on the new product launch",
]
payload = {"link_list": json.dumps(urls)}

payload = {"text_list": json.dumps(content), "strategy": strategy}

try:
resp = requests.post(url=url, data=payload, proxies=proxies)
Expand Down
31 changes: 22 additions & 9 deletions comps/guardrails/pii_detection/pii/pii_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,6 @@ def detect_pii(self, data):
return random.choice([True, False])


class PIIDetectorWithLLM(PIIDetector):
def __init__(self):
super().__init__()

def detect_pii(self, text):
return True


class PIIDetectorWithNER(PIIDetector):
def __init__(self, model_path=None):
super().__init__()
Expand All @@ -42,11 +34,13 @@ def __init__(self, model_path=None):
self.pipeline = pipeline(
model=_model_key, task="token-classification", tokenizer=tokenizer, grouped_entities=True
)
print("NER detector instantiated successfully!")
except Exception as e:
print("Failed to load model, skip NER classification", e)
self.pipeline = None

def detect_pii(self, text):
print("Scanning text with NER detector...")
result = []
# use a regex to detect ip addresses

Expand All @@ -71,7 +65,26 @@ def detect_pii(self, text):

class PIIDetectorWithML(PIIDetector):
def __init__(self):
import joblib
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

super().__init__()
print("Loading embedding model...")
embed_model_id = "nomic-ai/nomic-embed-text-v1"
self.model = SentenceTransformer(model_name_or_path=embed_model_id, trust_remote_code=True)

print("Loading classifier model...")
REPO_ID = "Intel/business_safety_logistic_regression_classifier"
FILENAME = "lr_clf.joblib"

self.clf = joblib.load(hf_hub_download(repo_id=REPO_ID, filename=FILENAME))

print("ML detector instantiated successfully!")

def detect_pii(self, text):
return True
# text is a string
print("Scanning text with ML detector...")
embeddings = self.model.encode(text, convert_to_tensor=True).reshape(1, -1).cpu()
predictions = self.clf.predict(embeddings)
return True if predictions[0] == 1 else False
39 changes: 21 additions & 18 deletions comps/guardrails/pii_detection/pii_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,7 @@

from comps import DocPath, opea_microservices, register_microservice
from comps.guardrails.pii_detection.data_utils import document_loader, parse_html
from comps.guardrails.pii_detection.pii.pii_utils import (
PIIDetector,
PIIDetectorWithLLM,
PIIDetectorWithML,
PIIDetectorWithNER,
)
from comps.guardrails.pii_detection.pii.pii_utils import PIIDetector, PIIDetectorWithML, PIIDetectorWithNER
from comps.guardrails.pii_detection.ray_utils import ray_execute, ray_runner_initialization, rayds_initialization
from comps.guardrails.pii_detection.utils import (
Timer,
Expand All @@ -38,14 +33,13 @@

def get_pii_detection_inst(strategy="dummy", settings=None):
if strategy == "ner":
print("invoking NER detector.......")
return PIIDetectorWithNER()
elif strategy == "ml":
print("invoking ML detector.......")
return PIIDetectorWithML()
elif strategy == "llm":
return PIIDetectorWithLLM()
else:
# Default strategy - dummy
return PIIDetector()
raise ValueError(f"Invalid strategy: {strategy}")


def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False, debug=False):
Expand All @@ -67,7 +61,7 @@ def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False,
for file in tqdm(file_list, total=len(file_list)):
with Timer(f"read document {file}."):
data = document_loader(file)
with Timer(f"detect pii on document {file} to Redis."):
with Timer(f"detect pii on document {file}"):
ret.append(pii_detector.detect_pii(data))
return ret

Expand Down Expand Up @@ -95,7 +89,7 @@ def _parse_html(link):
data = _parse_html(link)
if debug:
print("content is: ", data)
with Timer(f"detect pii on document {link} to Redis."):
with Timer(f"detect pii on document {link}"):
ret.append(pii_detector.detect_pii(data))
return ret

Expand All @@ -117,19 +111,28 @@ def text_based_pii_detect(text_list: List[str], strategy, enable_ray=False, debu
for data in tqdm(text_list, total=len(text_list)):
if debug:
print("content is: ", data)
with Timer(f"detect pii on document {data[:50]} to Redis."):
with Timer(f"detect pii on document {data[:50]}"):
ret.append(pii_detector.detect_pii(data))
return ret


@register_microservice(
name="opea_service@guardrails-pii-detection", endpoint="/v1/piidetect", host="0.0.0.0", port=6357
)
async def pii_detection(files: List[UploadFile] = File(None), link_list: str = Form(None), text_list: str = Form(None)):
async def pii_detection(
files: List[UploadFile] = File(None),
link_list: str = Form(None),
text_list: str = Form(None),
strategy: str = Form(None),
):
if not files and not link_list and not text_list:
raise HTTPException(status_code=400, detail="Either files, link_list, or text_list must be provided.")

strategy = "ner" # Default strategy
if strategy is None:
strategy = "ner"

print("PII detection using strategy: ", strategy)

pip_requirement = ["detect-secrets", "phonenumbers", "gibberish-detector"]

if files:
Expand All @@ -147,7 +150,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
await save_file_to_local_disk(save_path, file)
saved_path_list.append(DocPath(path=save_path))

enable_ray = False if len(saved_path_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = file_based_pii_detect(saved_path_list, strategy, enable_ray=enable_ray)
Expand All @@ -160,7 +163,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
text_list = json.loads(text_list) # Parse JSON string to list
if not isinstance(text_list, list):
text_list = [text_list]
enable_ray = False if len(text_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = text_based_pii_detect(text_list, strategy, enable_ray=enable_ray)
Expand All @@ -175,7 +178,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
link_list = json.loads(link_list) # Parse JSON string to list
if not isinstance(link_list, list):
link_list = [link_list]
enable_ray = False if len(link_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = link_based_pii_detect(link_list, strategy, enable_ray=enable_ray)
Expand Down
2 changes: 2 additions & 0 deletions comps/guardrails/pii_detection/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ beautifulsoup4
detect_secrets
docarray[full]
easyocr
einops
fastapi
gibberish-detector
huggingface_hub
Expand All @@ -21,6 +22,7 @@ pymupdf
python-docx
ray
redis
scikit-learn
sentence_transformers
shortuuid
virtualenv
47 changes: 18 additions & 29 deletions comps/guardrails/pii_detection/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,13 @@
from utils import Timer


def test_html(ip_addr="localhost", batch_size=20):
def test_html(ip_addr="localhost", batch_size=20, strategy=None):
import pandas as pd

proxies = {"http": ""}
url = f"http://{ip_addr}:6357/v1/piidetect"
urls = pd.read_csv("data/ai_rss.csv")["Permalink"]
urls = urls[:batch_size].to_list()
payload = {"link_list": json.dumps(urls)}
urls = ["https://opea.dev/"] * batch_size
payload = {"link_list": json.dumps(urls), "strategy": strategy}

with Timer(f"send {len(urls)} link to pii detection endpoint"):
try:
Expand All @@ -28,33 +27,19 @@ def test_html(ip_addr="localhost", batch_size=20):
print("An error occurred:", e)


def test_text(ip_addr="localhost", batch_size=20):
def test_text(ip_addr="localhost", batch_size=20, strategy=None):
proxies = {"http": ""}
url = f"http://{ip_addr}:6357/v1/piidetect"
if os.path.exists("data/ai_rss.csv"):
import pandas as pd

content = pd.read_csv("data/ai_rss.csv")["Description"]
content = content[:batch_size].to_list()
else:
content = (
[
"""With new architectures, there comes a bit of a dilemma. After having spent billions of dollars training models with older architectures, companies rightfully wonder if it is worth spending billions more on a newer architecture that may itself be outmoded&nbsp;soon.
One possible solution to this dilemma is transfer learning. The idea here is to put noise into the trained model and then use the output given to then backpropagate on the new model. The idea here is that you don’t need to worry about generating huge amounts of novel data and potentially the number of epochs you have to train for is also significantly reduced. This idea has not been perfected yet, so it remains to be seen the role it will play in the&nbsp;future.
Nevertheless, as businesses become more invested in these architectures the potential for newer architectures that improve cost will only increase. Time will tell how quickly the industry moves to adopt&nbsp;them.
For those who are building apps that allow for a seamless transition between models, you can look at the major strives made in throughput and latency by YOCO and have hope that the major bottlenecks your app is having may soon be resolved.
It’s an exciting time to be building.
With special thanks to Christopher Taylor for his feedback on this blog&nbsp;post.
[1] Sun, Y., et al. “You Only Cache Once: Decoder-Decoder Architectures for Language Models” (2024),&nbsp;arXiv
[2] Sun, Y., et al. “Retentive Network: A Successor to Transformer for Large Language Models” (2023),&nbsp;arXiv
[3] Wikimedia Foundation, et al. “Hadamard product (matrices)” (2024), Wikipedia
[4] Sanderson, G. et al., “Attention in transformers, visually explained | Chapter 6, Deep Learning” (2024),&nbsp;YouTube
[5] A. Vaswani, et al., “Attention Is All You Need” (2017),&nbsp;arXiv
Understanding You Only Cache Once was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story."""
]
* batch_size
)
payload = {"text_list": json.dumps(content)}
content = [
"Q1 revenue was $1.23 billion, up 12% year over year. ",
"We are excited to announce the opening of our new office in Miami! ",
"Mary Smith, 123-456-7890,",
"John is a good team leader",
"meeting minutes: sync up with sales team on the new product launch",
]

payload = {"text_list": json.dumps(content), "strategy": strategy}

with Timer(f"send {len(content)} text to pii detection endpoint"):
try:
Expand Down Expand Up @@ -90,13 +75,17 @@ def test_pdf(ip_addr="localhost", batch_size=20):
parser.add_argument("--test_text", action="store_true", help="Test Text pii detection")
parser.add_argument("--batch_size", type=int, default=20, help="Batch size for testing")
parser.add_argument("--ip_addr", type=str, default="localhost", help="IP address of the server")
parser.add_argument("--strategy", type=str, default="ml", help="Strategy for pii detection")

args = parser.parse_args()

print(args)

if args.test_html:
test_html(ip_addr=args.ip_addr, batch_size=args.batch_size)
elif args.test_pdf:
test_pdf(ip_addr=args.ip_addr, batch_size=args.batch_size)
elif args.test_text:
test_text(ip_addr=args.ip_addr, batch_size=args.batch_size)
test_text(ip_addr=args.ip_addr, batch_size=args.batch_size, strategy=args.strategy)
else:
print("Please specify the test type")
13 changes: 9 additions & 4 deletions tests/test_guardrails_pii_detection.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,16 @@ function validate_microservice() {
echo "Validate microservice started"
export PATH="${HOME}/miniforge3/bin:$PATH"
source activate
echo "test 1 - single task"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address
echo "test 2 - 20 tasks in parallel"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address
echo "test 1 - single task - ner"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ner
echo "test 2 - 20 tasks in parallel - ner"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ner
echo "test 3 - single task - ml"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ml
echo "test 4 - 20 tasks in parallel - ml"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ml
echo "Validate microservice completed"
docker logs test-guardrails-pii-detection-endpoint
}

function stop_docker() {
Expand Down

0 comments on commit de27e6b

Please sign in to comment.