Add ML detection strategy to PII detection guardrail (#292)

* first code of business safety classifier * allow strategy options and update readme * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable ray for ml strategy and update test script * add log to pii test Signed-off-by: minmin-intel <[email protected]> * update logging in test gaurdrail * rm llm strategy and change url in test Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * delete file check in test and update readme Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: minmin-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chen, suyue <[email protected]>
opea-project · Aug 1, 2024 · de27e6b · de27e6b
1 parent ee5b0f6
commit de27e6b
Show file tree

Hide file tree

Showing 6 changed files with 109 additions and 65 deletions.
diff --git a/comps/guardrails/pii_detection/README.md b/comps/guardrails/pii_detection/README.md
@@ -1,6 +1,31 @@
 # PII Detection Microservice
 
-PII Detection a method to detect Personal Identifiable Information in text. This microservice provides users a unified API to either upload your files or send a list of text, and return with a list following original sequence of labels marking if it contains PII or not.
+This microservice provides a unified API to detect if there is Personal Identifiable Information or Business Sensitive Information in text.
+
+We provide 2 detection strategies:
+
+1. Regular expression matching + named entity recognition (NER) - pass "ner" as strategy in your request to the microservice.
+2. Logistic regression classifier - pass "ml" as strategy in your request to the microservice. **Note**: Currently this strategy is for demo only, and only supports using `nomic-ai/nomic-embed-text-v1` as the embedding model and the `Intel/business_safety_logistic_regression_classifier` model as the classifier. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier) before using this strategy.
+
+## NER strategy
+
+We adopted the [pii detection code](https://github.com/bigcode-project/bigcode-dataset/tree/main/pii) of the [BigCode](https://www.bigcode-project.org/) project and use the bigcode/starpii model for NER. Currently this strategy can detect IP address, emails, phone numbers, alphanumeric keys, names and passwords. The IP address, emails, phone numbers, alphanumeric keys are detected with regular expression matching. The names and passwords are detected with NER. Please refer to the starpii [model card](https://huggingface.co/bigcode/starpii) for more information of the detection performance.
+
+## ML strategy
+
+We have trained a classifier model using the [Patronus EnterprisePII dataset](https://www.patronus.ai/announcements/patronus-ai-launches-enterprisepii-the-industrys-first-llm-dataset-for-detecting-business-sensitive-information) for the demo purpose only. Please note that the demo model has not been extensively tested so is not intended for use in production environment. Please read the [full disclaimers in the model card](https://huggingface.co/Intel/business_safety_logistic_regression_classifier).
+
+The classifiler model is used together with an embedding model to make predictions. The embedding model used for demo is `nomic-ai/nomic-embed-text-v1` [model](https://blog.nomic.ai/posts/nomic-embed-text-v1) available on Huggingface hub. We picked this open-source embedding model for demo as it is one of the top-performing long-context (max sequence length = 8192 vs. 512 for other BERT-based encoders) encoder models that do well on [Huggingface MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) as well as long-context [LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval). The long-context capability is useful when the text is long (>512 tokens).
+
+Currently this strategy can detect both personal sensitive and business sensitive information such as financial figures and performance reviews. Please refer to the [model card](<(https://huggingface.co/Intel/business_safety_logistic_regression_classifier)>) to see the performance of our demo model on the Patronus EnterprisePII dataset.
+
+# Input and output
+
+Users can send a list of files, a list of text strings, or a list of urls to the microservice, and the microservice will return a list of True or False for each piece of text following the original sequence.
+
+For a concrete example of what input should look like, please refer to [Consume Microservice](#4-consume-microservice) section below.
+
+The output will be a list of booleans, which can be parsed and used as conditions in a bigger application.
 
 # 🚀1. Start Microservice with Python（Option 1）
 
@@ -62,11 +87,18 @@ import requests
 import json
 
 proxies = {"http": ""}
-url = "http://localhost:6357/v1/dataprep"
-urls = [
-    "https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
+url = "http://localhost:6357/v1/piidetect"
+
+strategy = "ml"  # options: "ner", "ml"
+content = [
+    "Q1 revenue was $1.23 billion, up 12% year over year. ",
+    "We are excited to announce the opening of our new office in Miami! ",
+    "Mary Smith, 123-456-7890,",
+    "John is a good team leader",
+    "meeting minutes: sync up with sales team on the new product launch",
 ]
-payload = {"link_list": json.dumps(urls)}
+
+payload = {"text_list": json.dumps(content), "strategy": strategy}
 
 try:
     resp = requests.post(url=url, data=payload, proxies=proxies)

diff --git a/comps/guardrails/pii_detection/pii/pii_utils.py b/comps/guardrails/pii_detection/pii/pii_utils.py
@@ -22,14 +22,6 @@ def detect_pii(self, data):
         return random.choice([True, False])
 
 
-class PIIDetectorWithLLM(PIIDetector):
-    def __init__(self):
-        super().__init__()
-
-    def detect_pii(self, text):
-        return True
-
-
 class PIIDetectorWithNER(PIIDetector):
     def __init__(self, model_path=None):
         super().__init__()
@@ -42,11 +34,13 @@ def __init__(self, model_path=None):
             self.pipeline = pipeline(
                 model=_model_key, task="token-classification", tokenizer=tokenizer, grouped_entities=True
             )
+            print("NER detector instantiated successfully!")
         except Exception as e:
             print("Failed to load model, skip NER classification", e)
             self.pipeline = None
 
     def detect_pii(self, text):
+        print("Scanning text with NER detector...")
         result = []
         # use a regex to detect ip addresses
 
@@ -71,7 +65,26 @@ def detect_pii(self, text):
 
 class PIIDetectorWithML(PIIDetector):
     def __init__(self):
+        import joblib
+        from huggingface_hub import hf_hub_download
+        from sentence_transformers import SentenceTransformer
+
         super().__init__()
+        print("Loading embedding model...")
+        embed_model_id = "nomic-ai/nomic-embed-text-v1"
+        self.model = SentenceTransformer(model_name_or_path=embed_model_id, trust_remote_code=True)
+
+        print("Loading classifier model...")
+        REPO_ID = "Intel/business_safety_logistic_regression_classifier"
+        FILENAME = "lr_clf.joblib"
+
+        self.clf = joblib.load(hf_hub_download(repo_id=REPO_ID, filename=FILENAME))
+
+        print("ML detector instantiated successfully!")
 
     def detect_pii(self, text):
-        return True
+        # text is a string
+        print("Scanning text with ML detector...")
+        embeddings = self.model.encode(text, convert_to_tensor=True).reshape(1, -1).cpu()
+        predictions = self.clf.predict(embeddings)
+        return True if predictions[0] == 1 else False
diff --git a/comps/guardrails/pii_detection/pii_detection.py b/comps/guardrails/pii_detection/pii_detection.py
@@ -20,12 +20,7 @@
 
 from comps import DocPath, opea_microservices, register_microservice
 from comps.guardrails.pii_detection.data_utils import document_loader, parse_html
-from comps.guardrails.pii_detection.pii.pii_utils import (
-    PIIDetector,
-    PIIDetectorWithLLM,
-    PIIDetectorWithML,
-    PIIDetectorWithNER,
-)
+from comps.guardrails.pii_detection.pii.pii_utils import PIIDetector, PIIDetectorWithML, PIIDetectorWithNER
 from comps.guardrails.pii_detection.ray_utils import ray_execute, ray_runner_initialization, rayds_initialization
 from comps.guardrails.pii_detection.utils import (
     Timer,
@@ -38,14 +33,13 @@
 
 def get_pii_detection_inst(strategy="dummy", settings=None):
     if strategy == "ner":
+        print("invoking NER detector.......")
         return PIIDetectorWithNER()
     elif strategy == "ml":
+        print("invoking ML detector.......")
         return PIIDetectorWithML()
-    elif strategy == "llm":
-        return PIIDetectorWithLLM()
     else:
-        # Default strategy - dummy
-        return PIIDetector()
+        raise ValueError(f"Invalid strategy: {strategy}")
 
 
 def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False, debug=False):
@@ -67,7 +61,7 @@ def file_based_pii_detect(file_list: List[DocPath], strategy, enable_ray=False,
         for file in tqdm(file_list, total=len(file_list)):
             with Timer(f"read document {file}."):
                 data = document_loader(file)
-            with Timer(f"detect pii on document {file} to Redis."):
+            with Timer(f"detect pii on document {file}"):
                 ret.append(pii_detector.detect_pii(data))
     return ret
 
@@ -95,7 +89,7 @@ def _parse_html(link):
                 data = _parse_html(link)
             if debug:
                 print("content is: ", data)
-            with Timer(f"detect pii on document {link} to Redis."):
+            with Timer(f"detect pii on document {link}"):
                 ret.append(pii_detector.detect_pii(data))
     return ret
 
@@ -117,19 +111,28 @@ def text_based_pii_detect(text_list: List[str], strategy, enable_ray=False, debu
         for data in tqdm(text_list, total=len(text_list)):
             if debug:
                 print("content is: ", data)
-            with Timer(f"detect pii on document {data[:50]} to Redis."):
+            with Timer(f"detect pii on document {data[:50]}"):
                 ret.append(pii_detector.detect_pii(data))
     return ret
 
 
 @register_microservice(
     name="opea_service@guardrails-pii-detection", endpoint="/v1/piidetect", host="0.0.0.0", port=6357
 )
-async def pii_detection(files: List[UploadFile] = File(None), link_list: str = Form(None), text_list: str = Form(None)):
+async def pii_detection(
+    files: List[UploadFile] = File(None),
+    link_list: str = Form(None),
+    text_list: str = Form(None),
+    strategy: str = Form(None),
+):
     if not files and not link_list and not text_list:
         raise HTTPException(status_code=400, detail="Either files, link_list, or text_list must be provided.")
 
-    strategy = "ner"  # Default strategy
+    if strategy is None:
+        strategy = "ner"
+
+    print("PII detection using strategy: ", strategy)
+
     pip_requirement = ["detect-secrets", "phonenumbers", "gibberish-detector"]
 
     if files:
@@ -147,7 +150,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
                 await save_file_to_local_disk(save_path, file)
                 saved_path_list.append(DocPath(path=save_path))
 
-            enable_ray = False if len(saved_path_list) <= 10 else True
+            enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
             if enable_ray:
                 prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
             ret = file_based_pii_detect(saved_path_list, strategy, enable_ray=enable_ray)
@@ -160,7 +163,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
             text_list = json.loads(text_list)  # Parse JSON string to list
             if not isinstance(text_list, list):
                 text_list = [text_list]
-            enable_ray = False if len(text_list) <= 10 else True
+            enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
             if enable_ray:
                 prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
             ret = text_based_pii_detect(text_list, strategy, enable_ray=enable_ray)
@@ -175,7 +178,7 @@ async def pii_detection(files: List[UploadFile] = File(None), link_list: str = F
             link_list = json.loads(link_list)  # Parse JSON string to list
             if not isinstance(link_list, list):
                 link_list = [link_list]
-            enable_ray = False if len(link_list) <= 10 else True
+            enable_ray = False if (len(text_list) <= 10 or strategy == "ml") else True
             if enable_ray:
                 prepare_env(enable_ray=enable_ray, pip_requirements=pip_requirement, comps_path=comps_path)
             ret = link_based_pii_detect(link_list, strategy, enable_ray=enable_ray)

diff --git a/comps/guardrails/pii_detection/requirements.txt b/comps/guardrails/pii_detection/requirements.txt
@@ -2,6 +2,7 @@ beautifulsoup4
 detect_secrets
 docarray[full]
 easyocr
+einops
 fastapi
 gibberish-detector
 huggingface_hub
@@ -21,6 +22,7 @@ pymupdf
 python-docx
 ray
 redis
+scikit-learn
 sentence_transformers
 shortuuid
 virtualenv
diff --git a/comps/guardrails/pii_detection/test.py b/comps/guardrails/pii_detection/test.py
@@ -9,14 +9,13 @@
 from utils import Timer
 
 
-def test_html(ip_addr="localhost", batch_size=20):
+def test_html(ip_addr="localhost", batch_size=20, strategy=None):
     import pandas as pd
 
     proxies = {"http": ""}
     url = f"http://{ip_addr}:6357/v1/piidetect"
-    urls = pd.read_csv("data/ai_rss.csv")["Permalink"]
-    urls = urls[:batch_size].to_list()
-    payload = {"link_list": json.dumps(urls)}
+    urls = ["https://opea.dev/"] * batch_size
+    payload = {"link_list": json.dumps(urls), "strategy": strategy}
 
     with Timer(f"send {len(urls)} link to pii detection endpoint"):
         try:
@@ -28,33 +27,19 @@ def test_html(ip_addr="localhost", batch_size=20):
             print("An error occurred:", e)
 
 
-def test_text(ip_addr="localhost", batch_size=20):
+def test_text(ip_addr="localhost", batch_size=20, strategy=None):
     proxies = {"http": ""}
     url = f"http://{ip_addr}:6357/v1/piidetect"
-    if os.path.exists("data/ai_rss.csv"):
-        import pandas as pd
 
-        content = pd.read_csv("data/ai_rss.csv")["Description"]
-        content = content[:batch_size].to_list()
-    else:
-        content = (
-            [
-                """With new architectures, there comes a bit of a dilemma. After having spent billions of dollars training models with older architectures, companies rightfully wonder if it is worth spending billions more on a newer architecture that may itself be outmoded&nbsp;soon.
-One possible solution to this dilemma is transfer learning. The idea here is to put noise into the trained model and then use the output given to then backpropagate on the new model. The idea here is that you don’t need to worry about generating huge amounts of novel data and potentially the number of epochs you have to train for is also significantly reduced. This idea has not been perfected yet, so it remains to be seen the role it will play in the&nbsp;future.
-Nevertheless, as businesses become more invested in these architectures the potential for newer architectures that improve cost will only increase. Time will tell how quickly the industry moves to adopt&nbsp;them.
-For those who are building apps that allow for a seamless transition between models, you can look at the major strives made in throughput and latency by YOCO and have hope that the major bottlenecks your app is having may soon be resolved.
-It’s an exciting time to be building.
-With special thanks to Christopher Taylor for his feedback on this blog&nbsp;post.
-[1] Sun, Y., et al. “You Only Cache Once: Decoder-Decoder Architectures for Language Models” (2024),&nbsp;arXiv
-[2] Sun, Y., et al. “Retentive Network: A Successor to Transformer for Large Language Models” (2023),&nbsp;arXiv
-[3] Wikimedia Foundation, et al. “Hadamard product (matrices)” (2024), Wikipedia
-[4] Sanderson, G. et al., “Attention in transformers, visually explained | Chapter 6, Deep Learning” (2024),&nbsp;YouTube
-[5] A. Vaswani, et al., “Attention Is All You Need” (2017),&nbsp;arXiv
-Understanding You Only Cache Once was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story."""
-            ]
-            * batch_size
-        )
-    payload = {"text_list": json.dumps(content)}
+    content = [
+        "Q1 revenue was $1.23 billion, up 12% year over year. ",
+        "We are excited to announce the opening of our new office in Miami! ",
+        "Mary Smith, 123-456-7890,",
+        "John is a good team leader",
+        "meeting minutes: sync up with sales team on the new product launch",
+    ]
+
+    payload = {"text_list": json.dumps(content), "strategy": strategy}
 
     with Timer(f"send {len(content)} text to pii detection endpoint"):
         try:
@@ -90,13 +75,17 @@ def test_pdf(ip_addr="localhost", batch_size=20):
     parser.add_argument("--test_text", action="store_true", help="Test Text pii detection")
     parser.add_argument("--batch_size", type=int, default=20, help="Batch size for testing")
     parser.add_argument("--ip_addr", type=str, default="localhost", help="IP address of the server")
+    parser.add_argument("--strategy", type=str, default="ml", help="Strategy for pii detection")
 
     args = parser.parse_args()
+
+    print(args)
+
     if args.test_html:
         test_html(ip_addr=args.ip_addr, batch_size=args.batch_size)
     elif args.test_pdf:
         test_pdf(ip_addr=args.ip_addr, batch_size=args.batch_size)
     elif args.test_text:
-        test_text(ip_addr=args.ip_addr, batch_size=args.batch_size)
+        test_text(ip_addr=args.ip_addr, batch_size=args.batch_size, strategy=args.strategy)
     else:
         print("Please specify the test type")
diff --git a/tests/test_guardrails_pii_detection.sh b/tests/test_guardrails_pii_detection.sh
@@ -25,11 +25,16 @@ function validate_microservice() {
     echo "Validate microservice started"
     export PATH="${HOME}/miniforge3/bin:$PATH"
     source activate
-    echo "test 1 - single task"
-    python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address
-    echo "test 2 - 20 tasks in parallel"
-    python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address
+    echo "test 1 - single task - ner"
+    python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ner
+    echo "test 2 - 20 tasks in parallel - ner"
+    python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ner
+    echo "test 3 - single task - ml"
+    python comps/guardrails/pii_detection/test.py --test_text --batch_size 1 --ip_addr $ip_address --strategy ml
+    echo "test 4 - 20 tasks in parallel - ml"
+    python comps/guardrails/pii_detection/test.py --test_text --batch_size 20 --ip_addr $ip_address --strategy ml
     echo "Validate microservice completed"
+    docker logs test-guardrails-pii-detection-endpoint
 }
 
 function stop_docker() {