diff --git a/router/README.md b/router/README.md new file mode 100644 index 00000000..7414e302 --- /dev/null +++ b/router/README.md @@ -0,0 +1,146 @@ + + + + SambaNova logo + + + +Router +====================== + +Questions? Just message us on Discord Discord or create an issue in GitHub. We're happy to help live! + +Table of Contents: + +- [Router](#Router) +- [Overview](#overview) +- [Before you begin](#before-you-begin) + - [Clone this repository](#clone-this-repository) + - [Set up the models, environment variables and config file](#set-up-the-models-environment-variables-and-config-file) + - [Set up the generative model](#set-up-the-generative-model) + - [Set up the embedding model](#set-up-the-embedding-model) + - [Install dependencies](#install-dependencies) + - [Windows requirements](#use-the-starter-kit) +- [Use the starter kit](#use-the-starter-kit) +- [Customizing the starter kit](#customizing-the-starter-kit) +- [Third-party tools and data sources](#third-party-tools-and-data-sources) + + + +# Overview +This AI Starter Kit is an example of routing a user query to different RAG pipeline or LLM based on keywords from the datasource. + +The Kit includes: +- An implementation of a keyword extractor to extract keywords from documents +- An implementation of a workflow to route user query to different pipeline + +# Before you begin + +You have to set up your environment before you can run or customize the starter kit. + +## Clone this repository + +Clone the starter kit repo. +```bash +git clone https://github.com/sambanova/ai-starter-kit.git +``` + +## Set up the models, environment variables and config file + +### Set up the generative model + +The next step is to set up your environment variables to use one of the inference models available from SambaNova. You can obtain a free API key through SambaNova Cloud. Alternatively, if you are a current SambaNova customer, you can deploy your models using SambaStudio. + +- **SambaNova Cloud (Option 1)**: Follow the instructions [here](../README.md#use-sambanova-cloud-option-1) to set up your environment variables. + Then, in the [config file](./config.yaml), set the llm `api` variable to `"sncloud"` and set the `select_expert` config depending on the model you want to use. + +- **SambaStudio (Option 2)**: Follow the instructions [here](../README.md#use-sambastudio-option-2) to set up your endpoint and environment variables. + Then, in the [config file](./config.yaml), set the llm `api` variable to `"sambastudio"`, and set the `CoE` and `select_expert` configs if you are using a CoE endpoint. + +### Set up the embedding model + +You have the following options to set up your embedding model: + +* **CPU embedding model (Option 1)**: In the [config file](./config.yaml), set the variable `type` in `embedding_model` to `"cpu"`. + +* **SambaStudio embedding model (Option 2)**: To increase inference speed, you can use a SambaStudio embedding model endpoint instead of using the default (CPU) Hugging Face embedding. Follow the instructions [here](../README.md#use-sambastudio-embedding-option-2) to set up your endpoint and environment variables. Then, in the [config file](./config.yaml), set the variable `type` in `embedding_model` to `"sambastudio"`, and set the configs `batch_size`, `coe` and `select_expert` according to your SambaStudio endpoint. + +## Install dependencies + +We recommend that you run the starter kit in a virtual environment. + +NOTE: python 3.9 or higher is required to use this kit. + +Install the python dependencies in your project environment. + +```bash +cd ai_starter_kit/router +python3 -m venv router_env +source router_env/bin/activate +pip install -r requirements.txt +``` + +## Windows requirements + +- If you are using Windows, make sure your system has Microsoft Visual C++ Redistributable installed. You can install it from [Microsoft Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) and make sure to check all boxes regarding C++ section. (Compatible versions: 2015, 2017, 2019 or 2022) + + +# Use the starter kit + +After you've set up the environment, you can use the starter kit. Follow these steps: + +1. Put your documents under the [data](./data/) folder. + +2. Update the `keyword_path` under `router` in the [config file](./config.yaml). + +2. We provide an example to call the router and connect it with a RAG pipeline in [notebook/RAG_with_router.ipynb](./notebook/RAG_with_router.ipynb). + +# Customizing the starter kit +You can further customize the starter kit based on the use case. + +## Customize the keyword extractor method + +The [keyword extractor](./src/keyword_extractor.py) provides two methods to extract keywords: + +* Use the [KeyBert](https://github.com/MaartenGr/KeyBERT) library. It uses BERT-embeddings and cosine similarity to find the sub-phrases in a document that are the most similar to the document itself. + +* Use a generative language model. It uses prompt engineering to guide the LLM model to find keywords from documents. + +* Keywords can be extracted more efficiently by finding similarities between documents. We assume that highly similar documents will have the same keywords, so we extract keywords from only one document in each cluster and assign the keywords to all documents in the same cluster. To enble this feature, please set `use_clusters=True` under `router` in the [config file](./config.yaml). + +## Customize the embedding model + +By default, the keywords are exrtacted using a BERT-based embedding model. To change the embedding model, do the following: + +* If using CPU embedding (i.e., `type` in `embedding_model` is set to `"cpu"` in the [config file](./config.yaml)), [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) from HuggingFaceInstruct is used by default. If you want to use another model, you will need to manually modify the `EMBEDDING_MODEL` variable and the `load_embedding_model()` function in the [api_gateway.py](../utils/model_wrappers/api_gateway.py). +* If using SambaStudio embedding (i.e., `type` in `embedding_model` is set to `"sambastudio"` in the [config file](./config.yaml)), you will need to change the SambaStudio endpoint and/or the configs `batch_size`, `coe` and `select_expert` in the config file. + +## Customize the LLM model and/or use it to extract keywords + +To change the LLM model or modify the parameters for calling the model, make changes to the `router` in [config file](./config.yaml). + +The prompt for the model can be customized in [prompts/rag_routing_prompt.yaml](./prompts/rag_routing_prompt.yaml). + +You can also use your own yaml file by placing the file under [prompts](./prompts) folder and changing the path of `router_prompt` in [config file](./config.yaml). + +This LLM model can be applied to extract keywords by setting `use_llm=True` and `use_bert=False` in [config file](./config.yaml) + +The prompt for the model can be customized in [prompts/keyword_extractor_prompt.yaml](./prompts/keyword_extractor_prompt.yaml) + +## Customize the keyphrase extraction + +The keyword extractor uses [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers) to extract keyphrase from documents. You can choose other keyphrase extration methods by changing the `vectorizer` in `extract_keywords` function in [keyword_extractor.py](./src/keyword_extractor.py). + +```bash +if use_vectorizer: + vectorizer = KeyphraseTfidfVectorizer() + keyphrase_ngram_range = None +``` + +## Customize the RAG pipeline + +The RAG pipeline uses functions in [document_retrieval.py](../enterprise_knowledge_retriever/src/document_retrieval.py). Please refer to [enterprise_knowledge_retriever](../enterprise_knowledge_retriever/README.md) for how to customize the RAG. + +# Third-party tools and data sources + +All the packages/tools are listed in the `requirements.txt` file in the project directory. \ No newline at end of file diff --git a/router/config.yaml b/router/config.yaml new file mode 100644 index 00000000..e950793a --- /dev/null +++ b/router/config.yaml @@ -0,0 +1,41 @@ +api: "sncloud" # set either sambastudio or sncloud + +embedding_model: + "type": "sambastudio" # set either sambastudio or cpu + "batch_size": 1 #set depending of your endpoint configuration (1 if CoE embedding expert) + "coe": False #set true if using Sambastudio embeddings in a CoE endpoint + "select_expert": "e5-mistral-7b-instruct" #set if using SambaStudio CoE embedding expert + +router: + "type": "sncloud" # set either sambastudio or sncloud + "temperature": 0.0 + "do_sample": False + "max_tokens_to_generate": 1200 + "coe": True #set as true if using Sambastudio CoE endpoint + "select_expert": "llama3-8b" #set if using sncloud, or SambaStudio CoE llm expert + "document_folder": "router/data" # path of documents + "keyword_path": "router/keywords/keywords_3.pkl" # path to save keywords + "use_clusters": False # set True if extract keywords from only one document in each cluster + "use_bert": True # set True if use embedding and cosine similarity to extract keywords + "use_llm": False # set True if use llm to extract keywords + +llm: + "temperature": 0.0 + "do_sample": False + "max_tokens_to_generate": 1200 + "coe": True #set as true if using Sambastudio CoE endpoint + "select_expert": "llama3-8b" #set if using sncloud, or SambaStudio CoE llm expert + #sncloud CoE expert name -> "llama3-8b" + +retrieval: + "k_retrieved_documents": 15 #set if rerank enabled + "score_threshold": 0.2 + "rerank": False # set if you want to rerank retriever results + "reranker": 'BAAI/bge-reranker-large' # set if you rerank enabled + "final_k_retrieved_documents": 5 + +prompts: + "router_prompt": "router/prompts/rag_routing_prompt.yaml" + "qa_prompt": "enterprise_knowledge_retriever/prompts/qa_prompt.yaml" + "kw_etr_prompt": "router/prompts/keyword_extractor_prompt.yaml" + diff --git a/router/data/sambatune_run-sambatune.txt b/router/data/sambatune_run-sambatune.txt new file mode 100644 index 00000000..c760e23c --- /dev/null +++ b/router/data/sambatune_run-sambatune.txt @@ -0,0 +1,183 @@ +# Run SambaTune and examine reports + +After installation, you can run SambaTune from the command line. + +__ | You have to run SambaTune with the application yaml file as input before +you can see results of the performance analysis. +---|--- + +## Overview + +Running Sambatune includes running `sambatune` and running `sambatune_ui`. + + 1. First you run `sambatune` and pass in a YAML file for your model. See Run the sample application. + + 2. Then you can run the `sambatune_ui` command. See Run the SambaTune GUI. + + 3. Finally, you can [explore with the SambaTune GUI](gui-index.html) and [examine SambaTune reports](reports-index.html). + +## Run the sample application + +A sample application, `linear_net.py` is included with your installation at +`/opt/sambaflow/apps/micros/linear_net.py`. The application requires that the +`sambaflow-apps-micros` package is installed. + +To run the `linear_net.py` sample application: + + 1. Log in to the Linux console of a host that is attached to the DataScale hardware. + + 2. Run the application. You have several options: + + * Run the application in benchmarking mode (the default): + + $ sambatune linear_net.yaml + +where `linear.yaml` is a user-specified configuration file that is included in +the `sambaflow-apps-micros` package. Here's an example: + + + app: /path/to/linear.py + model-args: -b 128 -mb 64 --in-features 512 --out-features 128 + compile-args: compile --plot + run-args: -n 10000 + + * Run the application in instrument-only mode. The space after `--` is required. + + $ sambatune --modes instrument -- /opt/sambaflow/sambatune/configs/linear_net.yaml + + * Run in all modes. The space after `--` is required. + + $ sambatune --modes benchmark instrument run -- /opt/sambaflow/sambatune/configs/linear_net.yaml + +Run `sambatune --help` for a list of all options. See SambaTune input +arguments for details on configuration options. + +## Understand how SambaTune collects data + +When you run the sample application: + + 1. SambaTune compiles the application with the user-specified `model-args` , `compile-args` and SambaFlow-supported instrumentation flags. + + 2. After successful compile, SambaTune: + + 1. Runs the application on the RDU and collects performance data. + + 2. Runs the application in benchmark mode with user-specified `run-args` to collect latency, throughput, and hardware utilization statistics. + + 3. At the end of a successful run, SambaTune: + + 1. Collates compile-time and run-time statistics. + + 2. Generates performance reports. See [Explore SambaTune Reports](reports-index.html). + + 3. Displays the reports in the SambaTune GUI to help you identify potential hotspots. See [Explore with the SambaTune GUI](gui-index.html). + +## SambaTune input arguments + +You can customize your SambaTune run with the following input arguments: + +Table 1. SambaTune input arguments Option | Description | Dependencies | Type +---|---|---|--- + +app + +Name of the application. + +string + +compile-args + +Arguments to pass to the SambaFlow compiler. `compile-args` \+ `model-args` +are used for compilation (generating the PEF file). + +app + +string + +model-args + +Arguments to pass for running a specific model, like batch size. `compile- +args` \+ `model-args` are used for compilation (generating the PEF file). + +string + +run-args + +Arguments to pass when running the app that are used in addition to model- +args, for example, learning rate. The `run-args` and `model-args` are both +used when you run the model (represented by the PEF file). + +string + +env + +Runtime environment variables (optional). See Table 2 + +dict + +For subprocesses that are created by SambaTune, you can configure the +following environment variables: + +Table 2. Environment variables for SambaTune subprocesses Option | Description +| Type +---|---|--- + +SF_RNT_FSM_POLL_BUSY_WAIT + +1 to enable Graph completion busy wait + +int + +SF_RNT_DMA_POLL_BUSY_WAIT + +1 to enable DMA completion busy wait + +int + +## Run the SambaTune GUI + +The SambaTune GUI allows you to read the reports that are generated by one or +more SambaTune runs in a web browser. + +__ | You install the SambaTune GUI on the **client** system where the web +browser runs. Unlike the **host** system, the client does not have direct +access to RDU. +---|--- + +For release 1.16 of SambaTune, contact SambaNova customer support through the +SambaNova support portal at for client install +instructions. + + 1. On the machine where you installed the SambaTune GUI package, call `sambatune_ui`. + +You can specify some arguments to this command. Run `sambatune_ui --help` to +see the list of arguments. + + 2. When the `sambatune_ui` command completes, you see a URL, username, and password for accessing the GUI. Note down the password, which changes each time you call the `sambatune_ui` command. + +You can now examine the results of the SambaTune run in the SambaTune GUI. See +[Explore with the SambaTune GUI](gui-index.html). + +## Troubleshooting + +This section has troubleshooting information. + +**Symptom** + +SambaTune encountered an error during the run. + +**Explanation** + +A SambaTune run may encounter errors due to any number of reasons, ranging +from incorrect input configuration to compile error to run or post-processing +error. + +All run related information is saved to the output directory +(`$DUMP_ROOT/artifact_root/sambatune_gen/_`). +The status of the run can be checked in run.log or status_summary.log. The +details of a failed step can be checked in status_debug.log. For assistance, +contact Customer Support and provide the compressed output directory for +further diagnosis. + + + diff --git a/router/notebook/RAG_with_router.ipynb b/router/notebook/RAG_with_router.ipynb new file mode 100644 index 00000000..a20f0bda --- /dev/null +++ b/router/notebook/RAG_with_router.ipynb @@ -0,0 +1,211 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "import glob\n", + "from langchain_community.vectorstores import Chroma\n", + "from IPython.display import display, Markdown\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "current_dir = os.getcwd()\n", + "kit_dir = os.path.abspath(os.path.join(current_dir, \"..\"))\n", + "repo_dir = os.path.abspath(os.path.join(kit_dir, \"..\"))\n", + "\n", + "sys.path.append(kit_dir)\n", + "sys.path.append(repo_dir)\n", + "\n", + "from src.router import Router\n", + "from enterprise_knowledge_retriever.src.document_retrieval import DocumentRetrieval\n", + "\n", + "CONFIG_PATH = os.path.join(kit_dir,'config.yaml')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# init router" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "router = Router(CONFIG_PATH)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# init RAG" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-23 16:13:33,100 [INFO] - Load pretrained SentenceTransformer: intfloat/e5-large-v2\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "load INSTRUCTOR_Transformer\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-23 16:13:37,244 [INFO] - Use pytorch device: cpu\n", + "2024-09-23 16:13:37,245 [INFO] - This is the collection name: collection_b069af78-72b2-4bfb-af70-cc80159b693f\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "max_seq_length 512\n", + "Collection name is None\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-09-23 16:13:39,275 [INFO] - Vector store saved to data/vectordb/\n" + ] + } + ], + "source": [ + "folder_loc = os.path.join(kit_dir,'data/')\n", + "docx_files = list(glob.glob(f'{folder_loc}/*.txt'))\n", + "docs = []\n", + "for doc in docx_files:\n", + " with open(doc) as f:\n", + " docs.append(f.read())\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size = 800,\n", + " chunk_overlap = 20,\n", + " length_function = len,\n", + " add_start_index = True,\n", + " separators = [\"\\n\\n\\n\",\"\\n\\n\", \"\\n\", \"*\", \".\"]\n", + " )\n", + "text_chunks = text_splitter.create_documents(docs)\n", + "document_retrieval = DocumentRetrieval()\n", + "embeddings = document_retrieval.load_embedding_model()\n", + "vectorstore = document_retrieval.create_vector_store(text_chunks, embeddings, output_db=\"data/vectordb/\")\n", + "document_retrieval.init_retriever(vectorstore)\n", + "conversation = document_retrieval.get_qa_retrieval_chain()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# user input" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"How to use sambatune?\"\n", + "# query = \"What is LLM?\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# pipeline\n", + "The router returns the correct datasource based on the user's query and then uses the appropriate pipeline to answer the query." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "To use SambaTune, you need to run it with the application YAML file as input. Here are the steps:\n", + "\n", + "1. First, run `sambatune` and pass in a YAML file for your model. You can see an example of how to do this in the \"Run the sample application\" section.\n", + "2. Then, run the `sambatune_ui` command.\n", + "3. Finally, you can explore the results of the SambaTune run in the SambaTune GUI.\n", + "\n", + "You can also run SambaTune with specific modes, such as instrument-only mode or all modes, by using the `--modes` option. For example, to run in instrument-only mode, you can use the command `sambatune --modes instrument -- /path/to/config.yaml`.\n", + "\n", + "You can also specify other arguments, such as `model-args`, `compile-args`, and `run-args`, to customize the run. You can see a list of all options by running `sambatune --help`.\n", + "\n", + "It's worth noting that you need to install the SambaTune GUI on the client system where the web browser runs, and you can get client install instructions from SambaNova customer support through the SambaNova support portal.\n", + "--------------------------\n", + "datasource: vectorstore\n" + ] + } + ], + "source": [ + "datasource = router.routing(query)\n", + "if datasource == \"vectorstore\":\n", + " output = conversation.invoke({\"question\":query})\n", + " response = output['answer']\n", + "else:\n", + " response = document_retrieval.llm(query)\n", + "\n", + "print(response)\n", + "print(\"--------------------------\")\n", + "print(f\"datasource: {datasource}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/router/prompts/keyword_extractor_prompt.yaml b/router/prompts/keyword_extractor_prompt.yaml new file mode 100644 index 00000000..f82b2906 --- /dev/null +++ b/router/prompts/keyword_extractor_prompt.yaml @@ -0,0 +1,24 @@ +_type: prompt +input_types: {} +input_variables: +- DOCUMENTS +name: null +output_parser: null +partial_variables: {} +template: | + <|begin_of_text|><|start_header_id|>system<|end_header_id|> + You are a helpful assistant in retrieving keywords in documents. + You are tasked with analyzing a collection of documents and identifying the most relevant keywords. + For each document, select the top 5 unique keywords that best summarize the main themes, topics, and concepts discussed. + Consider words that appear frequently and capture the essence of the content. Ensure that the keywords are distinct from each other, avoiding repetition or overly similar terms. + Provide the output as a single string of the top 5 unique keywords, separated by commas, without any bullet points or numbering. Make sure the keywords are precise, meaningful, and representative of the document's subject matter. + Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document" + + Documents: + [DOCUMENTS] + <|eot_id|><|start_header_id|>user<|end_header_id|> + + Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|> + +template_format: f-string +validate_template: false \ No newline at end of file diff --git a/router/prompts/rag_routing_prompt.yaml b/router/prompts/rag_routing_prompt.yaml new file mode 100644 index 00000000..25f21367 --- /dev/null +++ b/router/prompts/rag_routing_prompt.yaml @@ -0,0 +1,25 @@ +_type: prompt +input_types: {} +input_variables: +- format_instructions +- keywords +- query +name: null +output_parser: null +partial_variables: {} +template: | + <|begin_of_text|><|start_header_id|>system<|end_header_id|> + You are an expert at routing a user question to a vectorstore or simple answering based on your own knowledge (llm). + The vectorstore contains documents related to a list of topics. + Use the vectorstore for questions on these topics. Otherwise, use llm. + The routing result should be either `vectorstore`or `llm`. + {format_instructions} + + Here are the topics in the vectorstore: + {keywords} + <|eot_id|><|start_header_id|>user<|end_header_id|> + Question: {query} + Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|> + +template_format: f-string +validate_template: false \ No newline at end of file diff --git a/router/requirements.txt b/router/requirements.txt new file mode 100644 index 00000000..d6baf9cf --- /dev/null +++ b/router/requirements.txt @@ -0,0 +1,13 @@ +langchain==0.2.8 +python-dotenv==1.0.1 +langchain_community==0.2.7 +langchain_core==0.2.19 +torch==2.1.1 +keybert==0.8.5 +keyphrase_vectorizers==0.0.13 + +sseclient-py==1.8.0 +instructorembedding==1.0.1 +sentence_transformers==2.2.2 +chromadb==0.5.5 +streamlit==1.36.0 \ No newline at end of file diff --git a/router/src/custom_keyLLM.py b/router/src/custom_keyLLM.py new file mode 100644 index 00000000..0b3c9afc --- /dev/null +++ b/router/src/custom_keyLLM.py @@ -0,0 +1,156 @@ +from typing import List, Union +import numpy as np + +try: + from sentence_transformers import util + HAS_SBERT = True +except ModuleNotFoundError: + HAS_SBERT = False + + +class CustomKeyLLM: + """ + A minimal method for keyword extraction with Large Language Models (LLM) + + The keyword extraction is done by simply asking the LLM to extract a + number of keywords from a single piece of text. + """ + + def __init__(self, llm): + """KeyBERT initialization + + Arguments: + llm: The Large Language Model to use + """ + self.llm = llm + + def extract_keywords( + self, + docs: Union[str, List[str]], + check_vocab: bool = False, + candidate_keywords: List[List[str]] = None, + threshold: float = None, + embeddings=None, + **model_params + ) -> Union[List[str], List[List[str]]]: + """Extract keywords and/or keyphrases + + To get the biggest speed-up, make sure to pass multiple documents + at once instead of iterating over a single document. + + NOTE: The resulting keywords are expected to be separated by commas so + any changes to the prompt will have to make sure that the resulting + keywords are comma-separated. + + Arguments: + docs: The document(s) for which to extract keywords/keyphrases + check_vocab: Only return keywords that appear exactly in the documents + candidate_keywords: Candidate keywords for each document + + Returns: + keywords: The top n keywords for a document with their respective distances + to the input document. + + Usage: + + To extract keywords from a single document: + + ```python + import openai + from keybert.llm import OpenAI + from keybert import KeyLLM + + # Create your LLM + client = openai.OpenAI(api_key=MY_API_KEY) + llm = OpenAI(client) + + # Load it in KeyLLM + kw_model = KeyLLM(llm) + + # Extract keywords + document = "The website mentions that it only takes a couple of days to deliver but I still have not received mine." + keywords = kw_model.extract_keywords(document) + ``` + """ + # Check for a single, empty document + if isinstance(docs, str): + if docs: + docs = [docs] + else: + return [] + + if HAS_SBERT and threshold is not None and embeddings is not None: + + # Find similar documents + clusters = util.community_detection(embeddings, min_community_size=2, threshold=threshold) + in_cluster = set([cluster for cluster_set in clusters for cluster in cluster_set]) + out_cluster = set(list(range(len(docs)))).difference(in_cluster) + + # Extract keywords for all documents not in a cluster + if out_cluster: + selected_docs = [docs[index] for index in out_cluster] + if candidate_keywords is not None: + selected_keywords = [candidate_keywords[index] for index in out_cluster] + else: + selected_keywords = None + selected_embeddings = np.array([embeddings[index].tolist() for index in out_cluster]) + out_cluster_keywords = self.llm.extract_keywords( + selected_docs, + **model_params + ) + if isinstance(out_cluster_keywords[0], tuple): + out_cluster_keywords = [out_cluster_keywords] + out_cluster_keywords = {index: words for words, index in zip(out_cluster_keywords, out_cluster)} + uniq_out_cluster_keywords = [out_cluster_keywords[index] for index in out_cluster_keywords] + # Extract keywords for only the first document in a cluster + if in_cluster: + selected_docs = [docs[cluster[0]] for cluster in clusters] + if candidate_keywords is not None: + selected_keywords = [candidate_keywords[cluster[0]] for cluster in clusters] + else: + selected_keywords = None + selected_embeddings = np.array([embeddings[cluster[0]].tolist() for cluster in clusters]) + in_cluster_keywords = self.llm.extract_keywords( + selected_docs, + **model_params + ) + if isinstance(in_cluster_keywords[0], tuple): + in_cluster_keywords = [in_cluster_keywords] + uniq_in_cluster_keywords = [] + uniq_in_cluster_keywords = in_cluster_keywords + in_cluster_keywords = { + doc_id: in_cluster_keywords[index] + for index, cluster in enumerate(clusters) + for doc_id in cluster + } + + # Update out cluster keywords with in cluster keywords + if out_cluster: + if in_cluster: + out_cluster_keywords.update(in_cluster_keywords) + keywords = [out_cluster_keywords[index] for index in range(len(docs))] + else: + keywords = [in_cluster_keywords[index] for index in range(len(docs))] + + uniq_keywords = [] + if out_cluster: + uniq_keywords.extend(uniq_out_cluster_keywords) + if in_cluster: + uniq_keywords.extend(uniq_in_cluster_keywords) + + else: + # Extract keywords using a Large Language Model (LLM) + keywords = self.llm.extract_keywords(docs, candidate_keywords) + + # Only extract keywords that appear in the input document + if check_vocab: + updated_keywords = [] + for keyword_set, document in zip(keywords, docs): + updated_keyword_set = [] + for keyword in keyword_set: + if keyword in document: + updated_keyword_set.append(keyword) + updated_keywords.append(updated_keyword_set) + return updated_keywords + + return keywords, uniq_keywords \ No newline at end of file diff --git a/router/src/custom_models.py b/router/src/custom_models.py new file mode 100644 index 00000000..b934b796 --- /dev/null +++ b/router/src/custom_models.py @@ -0,0 +1,73 @@ +import numpy as np +from keybert.backend import BaseEmbedder +from keybert.llm import BaseLLM +from keybert.llm._utils import process_candidate_keywords +from tqdm import tqdm + +class CustomEmbedder(BaseEmbedder): + def __init__(self, embedding_model): + super().__init__() + self.embedding_model = embedding_model + + def embed(self, documents, verbose=False): + if isinstance(documents, str): + embeddings = self.embedding_model.embed_query(documents) + elif isinstance(documents, list): + embeddings = self.embedding_model.embed_documents(documents) + elif isinstance(documents, np.ndarray): + embeddings = self.embedding_model.embed_documents(documents.tolist()) + return np.array(embeddings) + +DEFAULT_PROMPT = """ + <|begin_of_text|><|start_header_id|>system<|end_header_id|> + You are a helpful assistant in retrieving keywords in documents. + You are tasked with analyzing a collection of documents and identifying the most relevant keywords. + For each document, select the top 5 unique keywords that best summarize the main themes, topics, and concepts discussed. + Consider words that appear frequently and capture the essence of the content. Ensure that the keywords are distinct from each other, avoiding repetition or overly similar terms. + Provide the output as a single string of the top 5 unique keywords, separated by commas, without any bullet points or numbering. Make sure the keywords are precise, meaningful, and representative of the document's subject matter. + Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document" + + Documents: + [DOCUMENTS] + <|eot_id|><|start_header_id|>user<|end_header_id|> + + Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|> + """ + +class CustomTextGeneration(BaseLLM): + def __init__(self, + model, + prompt: str = None, + verbose: bool = False + ): + super().__init__() + self.model = model + self.prompt = prompt if prompt is not None else DEFAULT_PROMPT + self.verbose = verbose + + def extract_keywords(self, documents: list[str], candidate_keywords: list[list[str]] = None) -> list: + """ Extract topics + + Arguments: + documents: The documents to extract keywords from + candidate_keywords: A list of candidate keywords that the LLM will fine-tune + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or + shorten them depending on the input prompt. + + Returns: + list: All keywords for each document + """ + all_keywords = [] + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) + + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): + prompt = self.prompt.replace("[DOCUMENTS]", document) + if candidates is not None: + prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates)) + # Extract result from generator and use that as label + keywords = self.model(prompt).replace(prompt, "") + keywords = [keyword.strip() for keyword in keywords.split(",")] + all_keywords.append(keywords) + + return all_keywords \ No newline at end of file diff --git a/router/src/keyword_extractor.py b/router/src/keyword_extractor.py new file mode 100644 index 00000000..da5ba6e1 --- /dev/null +++ b/router/src/keyword_extractor.py @@ -0,0 +1,199 @@ +import sys, os, yaml +from typing import Union +from keybert import KeyBERT, KeyLLM +from keyphrase_vectorizers import KeyphraseTfidfVectorizer +from custom_models import CustomEmbedder, CustomTextGeneration +from custom_keyLLM import CustomKeyLLM +from langchain_core.prompts import load_prompt +import torch +import pickle +from dotenv import load_dotenv +current_dir = os.path.dirname(os.path.abspath(__file__)) +kit_dir = os.path.abspath(os.path.join(current_dir, '..')) +repo_dir = os.path.abspath(os.path.join(kit_dir, '..')) +sys.path.append(repo_dir) +sys.path.append(kit_dir) +sys.path.append(current_dir) + +load_dotenv(os.path.join(repo_dir, '.env')) + +from utils.model_wrappers.api_gateway import APIGateway + +class KeywordExtractor: + """ + Extract keywords using KeyBert https://github.com/MaartenGr/keyBERT + """ + def __init__(self, configs: dict, + docs: list[str], + use_bert: bool=True, + use_llm: bool=False, + use_llm_prompt: bool=False) -> None: + """_summary_ + + Args: + configs (dict): The config dict. + docs (list[str]): The list of docs contents. + use_bert (bool, optional): If use bert as keyword extractor. Defaults to True. + use_llm (bool, optional): If use llm as keyword extractor. Defaults to False. + use_llm_prompt (bool, optional): If use customized prompt for llm. Defaults to False. + Only applied when self.use_llm=True + """ + self.configs = configs + self.docs = docs + self.use_bert = use_bert + self.use_llm = use_llm + self.load_models() + self.create_kw_models(use_llm_prompt) + + def load_models(self) -> None: + """ + Load embedding model and LLM model. + """ + sambanova_api_key = os.environ.get('SAMBANOVA_API_KEY') + sambastudio_embeddings_base_url = os.environ.get('SAMBASTUDIO_EMBEDDINGS_BASE_URL') + sambastudio_embeddings_base_uri = os.environ.get('SAMBASTUDIO_EMBEDDINGS_BASE_URI') + sambastudio_embeddings_project_id = os.environ.get('SAMBASTUDIO_EMBEDDINGS_PROJECT_ID') + sambastudio_embeddings_endpoint_id = os.environ.get('SAMBASTUDIO_EMBEDDINGS_ENDPOINT_ID') + sambastudio_embeddings_api_key = os.environ.get('SAMBASTUDIO_EMBEDDINGS_API_KEY') + + # 1. load embedding model + self.embedding_model = APIGateway.load_embedding_model( + type=self.configs['embedding_model']['type'], + batch_size=self.configs['embedding_model']['batch_size'], + coe=self.configs['embedding_model']['coe'], + select_expert=self.configs['embedding_model']['select_expert'], + sambastudio_embeddings_base_url=sambastudio_embeddings_base_url, + sambastudio_embeddings_base_uri=sambastudio_embeddings_base_uri, + sambastudio_embeddings_project_id=sambastudio_embeddings_project_id, + sambastudio_embeddings_endpoint_id=sambastudio_embeddings_endpoint_id, + sambastudio_embeddings_api_key=sambastudio_embeddings_api_key, + ) + + # 2. load llm model + if self.use_llm: + self.llm = APIGateway.load_llm( + type=self.configs["api"], + streaming=False, + coe=self.configs['router']["coe"], + do_sample=self.configs['router']["do_sample"], + max_tokens_to_generate=self.configs['router']["max_tokens_to_generate"], + temperature=self.configs['router']["temperature"], + select_expert=self.configs['router']["select_expert"], + process_prompt=False, + sambanova_api_key=sambanova_api_key, + ) + + def create_kw_models(self, use_llm_prompt: bool=False) -> None: + """ + Create keword extractor using KeyBERT. + + Args: + use_llm_prompt (bool, optional): If use customized prompt for llm. Defaults to False. + Only applied when self.use_llm=True + + Raises: + NotImplementedError: Not support use both bert and llm as keyword extractor. + """ + # create kw_model + self.custom_embedder = CustomEmbedder(embedding_model=self.embedding_model) + # pass custom backend to keybert + self.kw_bert_model = KeyBERT(model=self.custom_embedder) + # load it in KeyLLM + if self.use_bert and self.use_llm: + raise NotImplementedError("Not support both bert and generative llm yet.") + elif self.use_bert: + self.kw_llm_model = CustomKeyLLM(self.kw_bert_model) + elif self.use_llm: + llm_prompt = None + if use_llm_prompt: + llm_prompt = load_prompt(repo_dir + '/' + self.configs['prompts']['kw_etr_prompt']).template + self.text_generator = CustomTextGeneration(self.llm, llm_prompt) + self.kw_llm_model = CustomKeyLLM(self.text_generator) + + def docs_embedding(self) -> None: + """ + Embedding documents. + """ + # embed docs + self.docs_embed = self.custom_embedder.embed(documents=self.docs) + + def extract_first_values(self, data: list, return_list: bool=False) -> Union[set, list]: + """ + Extract only the first set of keywords in each cluster, since each file in the same cluster has the same keywords. + + Args: + data (list): The list of keywords in all clusters. + return_list (bool, optional): Format the results as list or set. Defaults to False (format as set). + + Returns: + Union[set, list]: The set/list of keywords in each cluster. + """ + if isinstance(data[0], tuple): + data = [data] + if return_list: + result = [] + for sublist in data: + sublist_result = [] + for s in sublist: + # Extract the first element from the set + first_value = next(iter(s)) + sublist_result.append(first_value) + result.append(sublist_result) + else: + result = set() + for sublist in data: + for s in sublist: + # Extract the first element from the set + first_value = next(iter(s)) + result.add(first_value) + return result + + def extract_keywords(self, + use_clusters: bool=True, + use_vectorizer: bool=True, + keyphrase_ngram_range: tuple[int, int] = (1,1)) -> Union[set, list]: + """ + Extract keywords from docs. + + Args: + use_clusters (bool, optional): If enabled, semantically similar files are grouped into the same cluster. + Only the first file in each cluster is used for keyword extraction to minimize latency. Defaults to True. + use_vectorizer (bool, optional): If use keyphrase-vectorizers as vectorizer. Defaults to True. + If set to True, keyphrase_ngram_range is not used. + Details of keyphrase-vectorizers in https://pypi.org/project/keyphrase-vectorizers/. + keyphrase_ngram_range (tuple[int, int], optional): Length, in words, of the extracted keywords/keyphrases. Defaults to (1,1). + NOTE: This is not used if you passed a `vectorizer`. + + Returns: + Union[set, list]: The top n keywords for the documents + """ + # retrieve keywords + vectorizer = None + if use_vectorizer: + vectorizer = KeyphraseTfidfVectorizer() + keyphrase_ngram_range = None + + if use_clusters and self.use_bert: + _, keywords = self.kw_llm_model.extract_keywords(docs=self.docs, embeddings=torch.as_tensor(self.docs_embed), threshold=.9,use_maxsum=True,nr_candidates=20, top_n=5, vectorizer=vectorizer, keyphrase_ngram_range=keyphrase_ngram_range) + self.keywords = self.extract_first_values(keywords, return_list=False) + elif use_clusters and self.use_llm: + _, keywords = self.kw_llm_model.extract_keywords(docs=self.docs, embeddings=torch.as_tensor(self.docs_embed), threshold=.9) + self.keywords = list(set([item for sublist in keywords for item in sublist])) + elif not use_clusters and self.use_bert: + keywords = self.kw_bert_model.extract_keywords(docs=self.docs, keyphrase_ngram_range=keyphrase_ngram_range, use_maxsum=True,nr_candidates=20, top_n=5, vectorizer=vectorizer) + self.keywords = self.extract_first_values(keywords, return_list=False) + elif not use_clusters and self.use_llm: + keywords = self.text_generator.extract_keywords(self.docs) + self.keywords = list(set([item for sublist in keywords for item in sublist])) + return self.keywords + + def save_keywords(self, save_filepath: str) -> None: + """ + Save keywords to local path. + + Args: + save_filepath (str): The file path to save keywords. + """ + # save keywords + with open(save_filepath, "wb") as file: + pickle.dump(self.keywords, file) diff --git a/router/src/router.py b/router/src/router.py new file mode 100644 index 00000000..f7bf3b14 --- /dev/null +++ b/router/src/router.py @@ -0,0 +1,199 @@ +### Router +import os, sys, pickle, yaml +from dotenv import load_dotenv +from langchain_core.prompts import PromptTemplate, load_prompt +from langchain.output_parsers import ResponseSchema, StructuredOutputParser +from utils.model_wrappers.api_gateway import APIGateway +from typing import Union +current_dir = os.path.dirname(os.path.abspath(__file__)) +kit_dir = os.path.abspath(os.path.join(current_dir, '..')) +repo_dir = os.path.abspath(os.path.join(kit_dir, '..')) +sys.path.append(kit_dir) +sys.path.append(current_dir) +sys.path.append(repo_dir) +from keyword_extractor import KeywordExtractor +load_dotenv(os.path.join(kit_dir, '.env')) + +def read_keywords(filepath: str) -> Union[set, list]: + """ + Read keywords from local file path. + + Args: + filepath (str): The path of the keyword file. + + Returns: + set | list: the set/list of keywords. + """ + with open(filepath, "rb") as file: + keywords = pickle.load(file) + return keywords + +def read_files(directory: str, extension: str=".txt") -> list: + """ + Read files from directory. + + Args: + directory (str): The directory path that contains files. + extension (str, optional):The extension of the files. Defaults to ".txt". + + Raises: + ValueError: Check if the directory exist. + + Returns: + list: the list of file contents. + """ + if not os.path.isdir(directory): + raise NotADirectoryError(f"The directory {directory} doesn't exist!") + file_contents = [] + for filename in os.listdir(directory): + if filename.endswith(extension): + file_path = os.path.join(directory, filename) + with open(file_path, 'r', encoding='utf-8') as file: + file_contents.append(file.read()) + return file_contents + +class Router: + def __init__(self, configs: str) -> None: + """ + Initializes the router. + + Args: + configs: The configuration file path. + + Returns: + None + """ + self.configs = self.load_config(configs) + self.init_llm() + self.keywords = None + self.init_router() + + def load_config(self, filename: str) -> dict: + """ + Loads a YAML configuration file and returns its contents as a dictionary. + + Args: + filename: The path to the YAML configuration file. + + Returns: + A dictionary containing the configuration file's contents. + """ + + try: + with open(filename, 'r') as file: + return yaml.safe_load(file) + except FileNotFoundError: + raise FileNotFoundError(f'The YAML configuration file {filename} was not found.') + except yaml.YAMLError as e: + raise RuntimeError(f'Error parsing YAML file: {e}') + + def init_llm(self) -> None: + """ + Initializes the Large Language Model (LLM) based on the specified API. + + Args: + self: The instance of the class. + + Returns: + None + """ + # 1. load models + sambanova_api_key = os.environ.get('SAMBANOVA_API_KEY') + + self.llm = APIGateway.load_llm( + type=self.configs["router"]["type"], + streaming=False, + coe=self.configs['router']["coe"], + do_sample=self.configs['router']["do_sample"], + max_tokens_to_generate=self.configs['router']["max_tokens_to_generate"], + temperature=self.configs['router']["temperature"], + select_expert=self.configs['router']["select_expert"], + process_prompt=False, + sambanova_api_key=sambanova_api_key, + ) + + def init_router(self) -> None: + """ + Initializes the router. + + This method loads the router prompt and keywords, then combines it with the language + model and a JSON output parser. + + Args: + None + + Returns: + None + """ + # create prompt + route_prompt = load_prompt(repo_dir + '/' + self.configs['prompts']['router_prompt']) + + # load/extract keywords for docs + keyword_filpath = os.path.join(repo_dir, self.configs['router']['keyword_path']) + if os.path.isfile(keyword_filpath): + self.keywords = read_keywords(keyword_filpath) + else: + document_path = os.path.join(repo_dir, self.configs['router']['document_folder']) + self.extract_keyword(document_path, save_filepath=keyword_filpath) + + # create output parser + response_schemas = [ + ResponseSchema(name="datasource", description="choose vectorstore or llm"), + ResponseSchema( + name="explanation", + description="explain the reason to choose this datasource.", + ), + ] + output_parser = StructuredOutputParser.from_response_schemas(response_schemas) + + # format prompt + format_instructions = output_parser.get_format_instructions() + prompt = PromptTemplate( + template=route_prompt.template, + input_variables=["query"], + partial_variables={"format_instructions": format_instructions, "keywords": self.keywords}, + ) + + # create LCEL + self.router = prompt | self.llm | output_parser + + def routing(self, query: str) -> str: + """ + Route the user query to either vectorstore or llm. + + Args: + query (str): the user query + + Returns: + str: "vectorstore" or "llm" + """ + results = self.router.invoke({'query': query}) + return results["datasource"] + + def extract_keyword(self, file_folder: str, extension: str=".txt", save_filepath: str = None) -> None: + """ + Extract keywords from documents. + + Args: + file_folder (str): The folder contains the documents + extension (str, optional): The extension of the files. Defaults to ".txt". + save_filepath (str, optional): The file path to save the keywords. Defaults to None. + """ + # load docs + if os.path.isdir(file_folder): + docs = read_files(file_folder, extension=extension) + else: + raise NotADirectoryError(f'{file_folder} is not a directory.') + + # extract keywords + kw_etr = KeywordExtractor(configs=self.configs, + docs=docs, + use_bert=self.configs['router']['use_bert'], + use_llm=self.configs['router']['use_llm']) + kw_etr.docs_embedding() + self.keywords = kw_etr.extract_keywords(self.configs['router']['use_clusters']) + + # save keywords to local + if save_filepath: + kw_etr.save_keywords(save_filepath) + \ No newline at end of file