Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ML detection strategy to PII detection guardrail #292

Merged
merged 15 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 37 additions & 5 deletions comps/guardrails/pii_detection/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,31 @@
# PII Detection Microservice

PII Detection a method to detect Personal Identifiable Information in text. This microservice provides users a unified API to either upload your files or send a list of text, and return with a list following original sequence of labels marking if it contains PII or not.
This microservice provides a unified API to detect if there is Personal Identifiable Information or Business Sensitive Information in text.

We provide 2 detection strategies:
minmin-intel marked this conversation as resolved.
Show resolved Hide resolved

1. Regular expression matching + named entity recognition (NER) - pass "ner" as strategy in your request to the microservice.
2. Logistic regression classifier - pass "ml" as strategy in your request to the microservice. **Note**: Currently this strategy is for demo only, and only supports using `nomic-ai/nomic-embed-text-v1` as the embedding model and the `Intel/business_safety_logistic_regression_classifier` model as the classifier. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier) before using this strategy.

## NER strategy

We adopted the [pii detection code](https://github.com/bigcode-project/bigcode-dataset/tree/main/pii) of the [BigCode](https://www.bigcode-project.org/) project and use the bigcode/starpii model for NER. Currently this strategy can detect IP address, emails, phone numbers, alphanumeric keys, names and passwords. The IP address, emails, phone numbers, alphanumeric keys are detected with regular expression matching. The names and passwords are detected with NER. Please refer to the starpii [model card](https://huggingface.co/bigcode/starpii) for more information of the detection performance.

## ML strategy

We have trained a classifier model using the [Patronus EnterprisePII dataset](https://www.patronus.ai/announcements/patronus-ai-launches-enterprisepii-the-industrys-first-llm-dataset-for-detecting-business-sensitive-information) for the demo purpose only. Please note that the demo model has not been extensively tested so is not intended for use in production environment. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier).

The classifiler model is used together with an embedding model to make predictions. The embedding model used for demo is `nomic-ai/nomic-embed-text-v1` [model](https://blog.nomic.ai/posts/nomic-embed-text-v1) available on Huggingface hub. We picked this open-source embedding model for demo as it is one of the top-performing long-context (max sequence length = 8192 vs. 512 for other BERT-based encoders) encoder models that do well on [Huggingface MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) as well as long-context [LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval). The long-context capability is useful when the text is long (>512 tokens).

Currently this strategy can detect both personal sensitive and business sensitive information such as financial figures and performance reviews. Please refer to the [model card](<(https://huggingface.co/Intel/business_safety_logistic_regression_classifier)>) to see the performance of our demo model on the Patronus EnterprisePII dataset.

# Input and output

Users can send a list of files, a list of text strings, or a list of urls to the microservice, and the microservice will return a list of True or False for each piece of text following the original sequence.

For a concrete example of what input should look like, please refer to [Consume Microservice](#4-consume-microservice) section below.

The output will be a list of booleans, which can be parsed and used as conditions in a bigger application.

# 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -62,11 +87,18 @@ import requests
import json

proxies = {"http": ""}
url = "http://localhost:6357/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
url = "http://localhost:6357/v1/piidetect"

strategy = "ml" # options: "ner", "ml"
content = [
xuechendi marked this conversation as resolved.
Show resolved Hide resolved
"Q1 revenue was $1.23 billion, up 12% year over year. ",
"We are excited to announce the opening of our new office in Miami! ",
"Mary Smith, 123-456-7890,",
"John is a good team leader",
"meeting minutes: sync up with sales team on the new product launch",
]
payload = {"link_list": json.dumps(urls)}

payload = {"text_list": json.dumps(content), "strategy": strategy}

try:
resp = requests.post(url=url, data=payload, proxies=proxies)
Expand Down
31 changes: 22 additions & 9 deletions comps/guardrails/pii_detection/pii/pii_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,6 @@ def detect_pii(self, data):
return random.choice([True, False])


class PIIDetectorWithLLM(PIIDetector):
def __init__(self):
super().__init__()

def detect_pii(self, text):
return True


class PIIDetectorWithNER(PIIDetector):
def __init__(self, model_path=None):
super().__init__()
Expand All @@ -42,11 +34,13 @@ def __init__(self, model_path=None):
self.pipeline = pipeline(
model=_model_key, task="token-classification", tokenizer=tokenizer, grouped_entities=True
)
print("NER detector instantiated successfully!")
except Exception as e:
print("Failed to load model, skip NER classification", e)
self.pipeline = None

def detect_pii(self, text):
print("Scanning text with NER detector...")
result = []
# use a regex to detect ip addresses

Expand All @@ -71,7 +65,26 @@ def detect_pii(self, text):

class PIIDetectorWithML(PIIDetector):
def __init__(self):
import joblib
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

super().__init__()
print("Loading embedding model...")
embed_model_id = "nomic-ai/nomic-embed-text-v1"
self.model = SentenceTransformer(model_name_or_path=embed_model_id, trust_remote_code=True)
minmin-intel marked this conversation as resolved.
Show resolved Hide resolved

print("Loading classifier model...")
REPO_ID = "Intel/business_safety_logistic_regression_classifier"
FILENAME = "lr_clf.joblib"
minmin-intel marked this conversation as resolved.
Show resolved Hide resolved

self.clf = joblib.load(hf_hub_download(repo_id=REPO_ID, filename=FILENAME))

print("ML detector instantiated successfully!")

def detect_pii(self, text):
return True
# text is a string
print("Scanning text with ML detector...")
embeddings = self.model.encode(text, convert_to_tensor=True).reshape(1, -1).cpu()
predictions = self.clf.predict(embeddings)
return True if predictions[0] == 1 else False
39 changes: 21 additions & 18 deletions comps/guardrails/pii_detection/pii_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,7 @@

from comps import DocPath, opea_microservices, register_microservice
from comps.guardrails.pii_detection.data_utils import document_loader, parse_html
from comps.guardrails.pii_detection.pii.pii_utils import (
PIIDetector,
PIIDetectorWithLLM,
PIIDetectorWithML,
PIIDetectorWithNER,
)
from comps.guardrails.pii_detection.pii.pii_utils import PIIDetector, PIIDetectorWithML, PIIDetectorWithNER
from comps.guardrails.pii_detection.ray_utils import ray_execute, ray_runner_initialization, rayds_initialization
from comps.guardrails.pii_detection.utils import (
Timer,
Expand All @@ -38,14 +33,13 @@

def get_pii_detection_inst(strategy="dummy", settings=None):
if strategy == "ner":
print("invoking NER detector.......")
return PIIDetectorWithNER()
elif strategy == "ml":
print("invoking ML detector.......")
return PIIDetectorWithML()
elif strategy == "llm":
return PIIDetectorWithLLM()
else:
# Default strategy - dummy
return PIIDetector()
raise ValueError(f"Invalid strategy: {strategy}")


def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False, debug=False):
Expand All @@ -67,7 +61,7 @@ def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False,
for file in tqdm(file_list, total=len(file_list)):
with Timer(f"read document {file}."):
data = document_loader(file)
with Timer(f"detect pii on document {file} to Redis."):
with Timer(f"detect pii on document {file}"):
ret.append(pii_detector.detect_pii(data))
return ret

Expand Down Expand Up @@ -95,7 +89,7 @@ def _parse_html(link):
data = _parse_html(link)
if debug:
print("content is: ", data)
with Timer(f"detect pii on document {link} to Redis."):
with Timer(f"detect pii on document {link}"):
ret.append(pii_detector.detect_pii(data))
return ret

Expand All @@ -117,19 +111,28 @@ def text_based_pii_detect(text_list: List[str], strategy, enable_ray=False, debu
for data in tqdm(text_list, total=len(text_list)):
if debug:
print("content is: ", data)
with Timer(f"detect pii on document {data[:50]} to Redis."):
with Timer(f"detect pii on document {data[:50]}"):
ret.append(pii_detector.detect_pii(data))
return ret


@register_microservice(
name="opea_service@guardrails-pii-detection", endpoint="/v1/piidetect", host="0.0.0.0", port=6357
)
async def pii_detection(files: List[UploadFile] = File(None), link_list: str = Form(None), text_list: str = Form(None)):
async def pii_detection(
files: List[UploadFile] = File(None),
link_list: str = Form(None),
text_list: str = Form(None),
strategy: str = Form(None),
):
if not files and not link_list and not text_list:
raise HTTPException(status_code=400, detail="Either files, link_list, or text_list must be provided.")

strategy = "ner" # Default strategy
if strategy is None:
strategy = "ner"

print("PII detection using strategy: ", strategy)

pip_requirement = ["detect-secrets", "phonenumbers", "gibberish-detector"]
minmin-intel marked this conversation as resolved.
Show resolved Hide resolved

if files:
Expand All @@ -147,7 +150,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
await save_file_to_local_disk(save_path, file)
saved_path_list.append(DocPath(path=save_path))

enable_ray = False if len(saved_path_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = file_based_pii_detect(saved_path_list, strategy, enable_ray=enable_ray)
Expand All @@ -160,7 +163,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
text_list = json.loads(text_list) # Parse JSON string to list
if not isinstance(text_list, list):
text_list = [text_list]
enable_ray = False if len(text_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = text_based_pii_detect(text_list, strategy, enable_ray=enable_ray)
Expand All @@ -175,7 +178,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
link_list = json.loads(link_list) # Parse JSON string to list
if not isinstance(link_list, list):
link_list = [link_list]
enable_ray = False if len(link_list) <= 10 else True
enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
if enable_ray:
prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
ret = link_based_pii_detect(link_list, strategy, enable_ray=enable_ray)
Expand Down
2 changes: 2 additions & 0 deletions comps/guardrails/pii_detection/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ beautifulsoup4
detect_secrets
docarray[full]
easyocr
einops
fastapi
gibberish-detector
huggingface_hub
Expand All @@ -21,6 +22,7 @@ pymupdf
python-docx
ray
redis
scikit-learn
sentence_transformers
shortuuid
virtualenv
47 changes: 18 additions & 29 deletions comps/guardrails/pii_detection/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,13 @@
from utils import Timer


def test_html(ip_addr="localhost", batch_size=20):
def test_html(ip_addr="localhost", batch_size=20, strategy=None):
import pandas as pd

proxies = {"http": ""}
url = f"http://{ip_addr}:6357/v1/piidetect"
urls = pd.read_csv("data/ai_rss.csv")["Permalink"]
urls = urls[:batch_size].to_list()
payload = {"link_list": json.dumps(urls)}
urls = ["https://opea.dev/"] * batch_size
payload = {"link_list": json.dumps(urls), "strategy": strategy}

with Timer(f"send {len(urls)} link to pii detection endpoint"):
try:
Expand All @@ -28,33 +27,19 @@ def test_html(ip_addr="localhost", batch_size=20):
print("An error occurred:", e)


def test_text(ip_addr="localhost", batch_size=20):
def test_text(ip_addr="localhost", batch_size=20, strategy=None):
proxies = {"http": ""}
url = f"http://{ip_addr}:6357/v1/piidetect"
if os.path.exists("data/ai_rss.csv"):
import pandas as pd

content = pd.read_csv("data/ai_rss.csv")["Description"]
content = content[:batch_size].to_list()
else:
content = (
[
"""With new architectures, there comes a bit of a dilemma. After having spent billions of dollars training models with older architectures, companies rightfully wonder if it is worth spending billions more on a newer architecture that may itself be outmoded&nbsp;soon.
One possible solution to this dilemma is transfer learning. The idea here is to put noise into the trained model and then use the output given to then backpropagate on the new model. The idea here is that you don’t need to worry about generating huge amounts of novel data and potentially the number of epochs you have to train for is also significantly reduced. This idea has not been perfected yet, so it remains to be seen the role it will play in the&nbsp;future.
Nevertheless, as businesses become more invested in these architectures the potential for newer architectures that improve cost will only increase. Time will tell how quickly the industry moves to adopt&nbsp;them.
For those who are building apps that allow for a seamless transition between models, you can look at the major strives made in throughput and latency by YOCO and have hope that the major bottlenecks your app is having may soon be resolved.
It’s an exciting time to be building.
With special thanks to Christopher Taylor for his feedback on this blog&nbsp;post.
[1] Sun, Y., et al. “You Only Cache Once: Decoder-Decoder Architectures for Language Models” (2024),&nbsp;arXiv
[2] Sun, Y., et al. “Retentive Network: A Successor to Transformer for Large Language Models” (2023),&nbsp;arXiv
[3] Wikimedia Foundation, et al. “Hadamard product (matrices)” (2024), Wikipedia
[4] Sanderson, G. et al., “Attention in transformers, visually explained | Chapter 6, Deep Learning” (2024),&nbsp;YouTube
[5] A. Vaswani, et al., “Attention Is All You Need” (2017),&nbsp;arXiv
Understanding You Only Cache Once was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story."""
]
* batch_size
)
payload = {"text_list": json.dumps(content)}
content = [
"Q1 revenue was $1.23 billion, up 12% year over year. ",
"We are excited to announce the opening of our new office in Miami! ",
"Mary Smith, 123-456-7890,",
"John is a good team leader",
"meeting minutes: sync up with sales team on the new product launch",
]

payload = {"text_list": json.dumps(content), "strategy": strategy}

with Timer(f"send {len(content)} text to pii detection endpoint"):
try:
Expand Down Expand Up @@ -90,13 +75,17 @@ def test_pdf(ip_addr="localhost", batch_size=20):
parser.add_argument("--test_text", action="store_true", help="Test Text pii detection")
parser.add_argument("--batch_size", type=int, default=20, help="Batch size for testing")
parser.add_argument("--ip_addr", type=str, default="localhost", help="IP address of the server")
parser.add_argument("--strategy", type=str, default="ml", help="Strategy for pii detection")

args = parser.parse_args()

print(args)

if args.test_html:
test_html(ip_addr=args.ip_addr, batch_size=args.batch_size)
elif args.test_pdf:
test_pdf(ip_addr=args.ip_addr, batch_size=args.batch_size)
elif args.test_text:
test_text(ip_addr=args.ip_addr, batch_size=args.batch_size)
test_text(ip_addr=args.ip_addr, batch_size=args.batch_size, strategy=args.strategy)
else:
print("Please specify the test type")
13 changes: 9 additions & 4 deletions tests/test_guardrails_pii_detection.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,16 @@ function validate_microservice() {
echo "Validate microservice started"
export PATH="${HOME}/miniforge3/bin:$PATH"
source activate
echo "test 1 - single task"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address
echo "test 2 - 20 tasks in parallel"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address
echo "test 1 - single task - ner"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ner
echo "test 2 - 20 tasks in parallel - ner"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ner
echo "test 3 - single task - ml"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ml
echo "test 4 - 20 tasks in parallel - ml"
python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ml
minmin-intel marked this conversation as resolved.
Show resolved Hide resolved
echo "Validate microservice completed"
docker logs test-guardrails-pii-detection-endpoint
}

function stop_docker() {
Expand Down
Loading