Autofill Prompt (#16)

* Minor Backend Changes Added PSSCOC docs downloader script. Redirect / to /docs for swagger docs. Updated poetry dependencies. Updated torch-cuda dependency as optional, defaulting to torch-cpu. * Updated README.md * Added Autofill Prompt Component Added a base component to autofill the prompt message selected by the user. * Update Dockerfile & workflow in line with updated poetry * Moved login-buttons into ui components * Updated transitions for autofill dialog * Added autofill dialog to query page * Usability improvements to search function * Added a few todo * Added loading spinner to status icon * Update Search to use interface * Added Autofill Prompt for Search * Updated Package Ver * Updated pyproject.toml, README, Dockerfile & test * Updated poetry.lock * Updated poetry lock and pyproject.toml * Commented out cpu torch, to install cuda version only. * Updated python test workflow to not install torch package * Updated README
digitalbuiltenvironment · Feb 27, 2024 · 7d9d30d · 7d9d30d
1 parent 7091242
commit 7d9d30d
Show file tree

Hide file tree

Showing 27 changed files with 2,099 additions and 1,338 deletions.
diff --git a/.github/workflows/python-tests.yml b/.github/workflows/python-tests.yml
@@ -33,7 +33,7 @@ jobs:
       run: |
         # python -m pip install --upgrade pip setuptools wheel
         # python -m pip install poetry
-        poetry install --without torch-cuda
+        poetry install --with dev
     - name: Lint with flake8
       working-directory: ./backend
       run: |

diff --git a/Dockerfile b/Dockerfile
@@ -40,8 +40,9 @@ ENV CUDA_DOCKER_ARCH=all \
     # Set the uvicorn env
     ENVIRONMENT=prod \
     ##########################################################
-    # Build llama-cpp-python with cuda support
+    # # Build llama-cpp-python with cuda support
     # CMAKE_ARGS="-DLLAMA_CUBLAS=on"
+    ##########################################################
     # Build llama-cpp-python with openblas support on CPU
     CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
     ##########################################################
@@ -62,7 +63,7 @@ COPY --chown=user ./backend/pyproject.toml ./backend/poetry.lock $HOME/app/
 COPY --chown=user ./backend $HOME/app
 
 # Install the dependencies
-RUN poetry install --without dev,torch-cpu && \
+RUN poetry install --with torch-cuda && \
     rm -rf /tmp/poetry_cache
 
 # Change to the package directory

diff --git a/README.md b/README.md
@@ -50,8 +50,6 @@ pinned: false
 Smart Retrieval is a platform for efficient and streamlined information retrieval, especially in the realm of legal and compliance documents.
 With the power of Open-Source Large Language Models (LLM) and Retrieval Augmented Generation (RAG), it aims to enhance user experiences at JTC by addressing key challenges such as manual search inefficiencies and rigid file naming conventions, revolutionizing the way JTC employees access and comprehend crucial documents
 
-Project files bootstrapped with [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama).
-
 ## 🏁 Getting Started <a name = "getting_started"></a>
 
 These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See [deployment](#deployment) for notes on how to deploy the project on a live system.
@@ -76,6 +74,7 @@ For more information, see the [DEPLOYMENT.md](./DEPLOYMENT.md).
 - [Python](https://python.org/) - Backend Server Environment
 - [FastAPI](https://fastapi.tiangolo.com/) - Backend API Web Framework
 - [LlamaIndex](https://www.llamaindex.ai/) - Data Framework for LLM
+- [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama) - LlamaIndex Application Bootstrap Tool
 
 ## 📑 Contributing <a name = "contributing"></a>
 

diff --git a/backend/README.md b/backend/README.md
@@ -2,6 +2,8 @@
 
 The backend is built using Python & [FastAPI](https://fastapi.tiangolo.com/) bootstrapped with [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama).
 
+To get started, you must first install the required dependencies in `Requirements` section below, then follow the `Getting Started` section.
+
 ## Requirements
 
 1. Python >= 3.11
@@ -21,9 +23,14 @@ The backend is built using Python & [FastAPI](https://fastapi.tiangolo.com/) boo
 
 ## Getting Started
 
-First, ensure if you want to use the cuda version of pytorch, you have the correct version `cuda > 12.1` of cuda installed. You can check this by running `nvcc --version or nvidia-smi` in your terminal. If you do not have cuda installed, you can install it from [here](https://developer.nvidia.com/cuda-downloads).
+First, ensure if you want to use the cuda version of pytorch, you have the correct version `cuda > 12.1` of cuda installed. You can check this by running `nvcc --version or nvidia-smi` in your terminal, nvcc --version should correctly chow whether you have installed cuda properly or not. If you do not have cuda installed, you can install it from [here](https://developer.nvidia.com/cuda-downloads).
+
+- You may need to add cuda to your path, which can be found online.
+
+Ensure you have followed the steps in the `requirements` section above.
 
-Ensure you have followed the steps in the requirements section above.
+- If on windows, make sure you are running the commands in powershell.
+- Add conda to your path, which can be found [here](https://stackoverflow.com/questions/64149680/how-can-i-activate-a-conda-environment-from-powershell)
 
 Then activate the conda environment:
 
@@ -33,24 +40,29 @@ conda activate SmartRetrieval
 
 Second, setup the environment:
 
-```bash
-# Only run one of the following commands:
+```powershell
+# Only choose one of the options below depending on if you have CUDA enabled GPU or not:
+# If running on windows, make sure you are running the commands in powershell.
 -----------------------------------------------
 # Install dependencies and torch (cpu version)
+# Go to the backend directory and edit the pyproject.toml file to uncomment the `torch-cpu` poetry section
+-----------------------------------------------
 # Windows: Set env for llama-cpp-python with openblas support on cpu
 $env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
 # Linux: Set env for llama-cpp-python with openblas support on cpu
 CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
 # Then:
-poetry install --without torch-cuda
+poetry install --with torch-cpu
 -----------------------------------------------
 # Install dependencies and torch (cuda version)
+# Installing torch with cuda support on a system without cuda support is also possible.
+-----------------------------------------------
 # Windows: Set env for llama-cpp-python with cuda support on gpu
 $env:CMAKE_ARGS = "-DLLAMA_CUBLAS=on"
 # Linux: Set env for llama-cpp-python with cuda support on gpu
 CMAKE_ARGS="-DLLAMA_CUBLAS=on"
 # Then:
-poetry install --without torch-cpu
+poetry install --with torch-cuda
 ```
 
 ```bash

diff --git a/backend/backend/app/utils/contants.py b/backend/backend/app/utils/contants.py
@@ -7,7 +7,7 @@
 
 # Model Constants
 MAX_NEW_TOKENS = 4096
-CONTEXT_SIZE = MAX_NEW_TOKENS
+CONTEXT_SIZE = 3900  # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
 DEVICE_TYPE = "cuda" if is_cuda_available() else "cpu"
 
 # Get the current directory
@@ -18,6 +18,7 @@
 
 # LLM Model Constants
 LLM_MODEL_URL = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
+LLM_TEMPERATURE = 0.1
 # Model Kwargs
 # set to at least 1 to use GPU, adjust according to your GPU memory, but must be able to fit the model
 MODEL_KWARGS = {"n_gpu_layers": 100} if DEVICE_TYPE == "cuda" else {}

diff --git a/backend/backend/app/utils/index.py b/backend/backend/app/utils/index.py
@@ -28,6 +28,7 @@
     EMBED_MODEL_NAME,
     EMBED_POOLING,
     LLM_MODEL_URL,
+    LLM_TEMPERATURE,
     MAX_NEW_TOKENS,
     MODEL_KWARGS,
     NUM_OUTPUT,
@@ -36,12 +37,11 @@
 
 llm = LlamaCPP(
     model_url=LLM_MODEL_URL,
-    temperature=0.1,
+    temperature=LLM_TEMPERATURE,
     max_new_tokens=MAX_NEW_TOKENS,
-    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
     context_window=CONTEXT_SIZE,
     # kwargs to pass to __call__()
-    # generate_kwargs={},
+    generate_kwargs={},
     # kwargs to pass to __init__()
     model_kwargs=MODEL_KWARGS,
     # transform inputs into Llama2 format
@@ -50,14 +50,6 @@
     verbose=True,
 )
 
-# define prompt helper
-# set maximum input size
-max_input_size = 4096
-# set number of output tokens
-num_output = 256
-# set maximum chunk overlap
-max_chunk_overlap = 0.2
-
 embed_model = HuggingFaceEmbedding(
     model_name=EMBED_MODEL_NAME,
     pooling=EMBED_POOLING,

diff --git a/backend/backend/get_PSSCOC_docs.py b/backend/backend/get_PSSCOC_docs.py
@@ -0,0 +1,186 @@
+import json
+import os
+
+import requests
+from bs4 import BeautifulSoup
+from doc2docx import convert as convert_doc2docx
+
+"""
+A web scraping script to download the Public Sector Standard Conditions of Contract (PSSCOC) documents from the BCA website.
+The script will create a folder for each category and download the documents into the respective category folder.
+The script will also convert doc files to docx files.
+The script will also get the About PSSCOC page info and saves to a json file.
+"""
+
+# Website URL
+base_url = "https://www1.bca.gov.sg"
+docs_endpoint = "/procurement/post-tender-stage/public-sector-standard-conditions-of-contract-psscoc"
+
+# Source Documents Folder
+source_doc_dir = "data"
+# PSSCOC Page Info Folder
+psscoc_page_info_dir = "About PSSCOC"
+
+
+# Get the PSSCOC documents
+def get_psscoc_docs():
+    """
+    Downloads the PSSCOC documents.
+    """
+    # Send a GET request to the website URL
+    response = requests.get(base_url + docs_endpoint, timeout=10)
+    response.raise_for_status()  # Check for HTTP errors
+
+    # Parse the HTML content of the response
+    soup = BeautifulSoup(response.content, "html.parser")
+
+    # Find the table element that contains the  Conditions of Contract & Downloads link
+    table = soup.find("table")
+
+    # Loop through each row of the table element, skipping the first row
+    for row in table.find_all("tr")[1:]:
+        # Find the first cell of the row that contains the category name
+        first_cell = row.find("td", attrs={"scope": "row"})
+        category_name = first_cell.find("strong").text.strip()
+        print("Category:", category_name)
+
+        # Create a folder for each category if it doesn't exist
+        category_folder = os.path.join(source_doc_dir, category_name)
+        if not os.path.exists(category_folder):
+            os.makedirs(category_folder)
+
+        # Find the second cell of the row that contains the href links
+        second_cell = row.find("ol")
+        # Loop through each list item in the second cell
+        for li in second_cell.find_all("li"):
+            # Find the href link
+            href_link = li.find("a")["href"]
+            # if starts with /docs, then it is a relative link
+            if href_link.startswith("/docs"):
+                href_link = base_url + href_link
+
+            # Get the filename from the href link
+            filename = os.path.basename(href_link).split("?")[0]
+            print("Downloading:", filename)
+            # Send a GET request to the href link
+            response = requests.get(href_link, timeout=10)
+            # Write the response content to a file
+            with open(os.path.join(category_folder, filename), "wb") as f:
+                f.write(response.content)
+                print("Saved to:", os.path.join(category_folder, filename))
+            # convert doc to docx
+            if filename.endswith(".doc"):
+                print("Converting to docx...")
+                convert_doc2docx(os.path.join(category_folder, filename))
+                print(
+                    "Converted to:",
+                    os.path.join(category_folder, filename + "x"),
+                )
+                # remove the original doc file
+                os.remove(os.path.join(category_folder, filename))
+        # line break
+        print("-" * 100)
+
+
+def get_psscoc_page_info():
+    """
+    Get the About PSSCOC page info and saves to a json file.
+    """
+    print("Getting PSSCOC Page Info...")
+    # Send a GET request to the website URL
+    response = requests.get(base_url + docs_endpoint, timeout=10)
+    response.raise_for_status()  # Check for HTTP errors
+
+    # Parse the HTML content of the response
+    soup = BeautifulSoup(response.content, "html.parser")
+
+    # Extract the necessary HTML elements
+    mid_body = soup.find("div", attrs={"class": "mid"})
+
+    cleaned_results = {}
+
+    # Extract title from the mid_body
+    title = mid_body.find("div", attrs={"class": "title"}).text.strip()
+    cleaned_results["Title"] = title
+
+    # Extract the flow content from the mid_body
+    flow_content = mid_body.find("ul", attrs={"class": "rsmFlow rsmLevel rsmOneLevel"})
+    flow = [li.text.strip() for li in flow_content.find_all("li")]
+    cleaned_results["Stage of PSSCOC"] = " > ".join(flow[1:])
+
+    # Extract sfContentBlock from the mid_body
+    sf_content_block = mid_body.find("div", attrs={"class": "sfContentBlock"})
+
+    # Extract all the paragraphs from the sfContentBlock but not nested paragraphs
+    paragraphs = sf_content_block.find_all("p", recursive=False)
+    paragraphs_text = [p.text.strip() for p in paragraphs]
+    cleaned_results["About"] = paragraphs_text[0]
+    more_about = {}
+    more_about[paragraphs_text[1]] = (
+        paragraphs_text[2]
+        .replace("\n", " ")
+        .replace("\r", " ")
+        .replace("\u00a0", " ")
+        .strip()
+    )
+
+    # Extract ul from the sfContentBlock
+    ul_content = sf_content_block.find("ul")
+    ul_li = [li.text.strip() for li in ul_content.find_all("li")]
+    more_about[paragraphs_text[3]] = ul_li  # title for 3rd paragraph
+
+    cleaned_results["More About"] = more_about
+
+    # Extract the table from the sfContentBlock
+    table = sf_content_block.find("table")
+    table_rows = table.find_all("tr")
+    header_row = [td.text.strip() for td in table_rows[0].find_all("td")]
+    header_row = ["Category Name", "Category File Names"]
+    table_data_list = []
+    for row in table_rows[1:]:
+        row_data = [td.text.strip() for td in row.find_all("td")]
+        row_data[-1] = row_data[-1].split("\n")
+        # remove empty string
+        row_data[-1] = [x.strip() for x in row_data[-1] if x]
+        table_data = dict(zip(header_row, row_data))
+        table_data_list.append(table_data)
+    cleaned_results["Categories of PSSCOC"] = table_data_list
+
+    # Extract all the divs from the sfContentBlock but not nested divs
+    # divs = sf_content_block.find_all("div", recursive=False)
+
+    # Save the json results content to a file
+    page_info_folder = os.path.join(source_doc_dir, psscoc_page_info_dir)
+    file_name = "About PSSCOC.json"
+    # Create a folder for page info if it doesn't exist
+    if not os.path.exists(page_info_folder):
+        os.makedirs(page_info_folder)
+    # Save the results content to a file
+    with open(os.path.join(page_info_folder, file_name), "wb") as f:
+        f.write(json.dumps(cleaned_results, indent=4).encode("utf-8"))
+        print(
+            "Saved to:",
+            os.path.join(page_info_folder, file_name),
+        )
+
+
+# Main function
+def main():
+    """
+    Main function.
+    """
+    try:
+        # Get the PSSCOC documents
+        get_psscoc_docs()
+        # Get the About PSSCOC page info
+        get_psscoc_page_info()
+    except requests.exceptions.RequestException as e:
+        print("Error: Failed to make a request to the website.")
+        print(e)
+    except Exception as e:
+        print("An unexpected error occurred:")
+        print(e)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/backend/backend/main.py b/backend/backend/main.py
@@ -4,6 +4,7 @@
 from dotenv import load_dotenv
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import RedirectResponse
 from torch.cuda import is_available as is_cuda_available
 
 from backend.app.api.routers.chat import chat_router
@@ -55,3 +56,9 @@
 
 # Try to create the index first on startup
 create_index()
+
+
+# Redirect to the /docs endpoint
+@app.get("/")
+async def docs_redirect():
+    return RedirectResponse(url="/docs")