Skip to content

Commit

Permalink
Autofill Prompt (#16)
Browse files Browse the repository at this point in the history
* Minor Backend Changes

Added PSSCOC docs downloader script.
Redirect / to /docs for swagger docs.
Updated poetry dependencies.
Updated torch-cuda dependency as optional, defaulting to torch-cpu.

* Updated README.md

* Added Autofill Prompt Component

Added a base component to autofill the prompt message selected by the user.

* Update Dockerfile & workflow in line with updated poetry

* Moved login-buttons into ui components

* Updated transitions for autofill dialog

* Added autofill dialog to query page

* Usability improvements to search function

* Added a few todo

* Added loading spinner to status icon

* Update Search to use interface

* Added Autofill Prompt for Search

* Updated Package Ver

* Updated pyproject.toml, README, Dockerfile & test

* Updated poetry.lock

* Updated poetry lock and pyproject.toml

* Commented out cpu torch, to install cuda version only.

* Updated python test workflow to not install torch package

* Updated README
  • Loading branch information
xKhronoz authored Feb 27, 2024
1 parent 7091242 commit 7d9d30d
Show file tree
Hide file tree
Showing 27 changed files with 2,099 additions and 1,338 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
run: |
# python -m pip install --upgrade pip setuptools wheel
# python -m pip install poetry
poetry install --without torch-cuda
poetry install --with dev
- name: Lint with flake8
working-directory: ./backend
run: |
Expand Down
5 changes: 3 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ ENV CUDA_DOCKER_ARCH=all \
# Set the uvicorn env
ENVIRONMENT=prod \
##########################################################
# Build llama-cpp-python with cuda support
# # Build llama-cpp-python with cuda support
# CMAKE_ARGS="-DLLAMA_CUBLAS=on"
##########################################################
# Build llama-cpp-python with openblas support on CPU
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
##########################################################
Expand All @@ -62,7 +63,7 @@ COPY --chown=user ./backend/pyproject.toml ./backend/poetry.lock $HOME/app/
COPY --chown=user ./backend $HOME/app

# Install the dependencies
RUN poetry install --without dev,torch-cpu && \
RUN poetry install --with torch-cuda && \
rm -rf /tmp/poetry_cache

# Change to the package directory
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,6 @@ pinned: false
Smart Retrieval is a platform for efficient and streamlined information retrieval, especially in the realm of legal and compliance documents.
With the power of Open-Source Large Language Models (LLM) and Retrieval Augmented Generation (RAG), it aims to enhance user experiences at JTC by addressing key challenges such as manual search inefficiencies and rigid file naming conventions, revolutionizing the way JTC employees access and comprehend crucial documents

Project files bootstrapped with [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama).

## 🏁 Getting Started <a name = "getting_started"></a>

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See [deployment](#deployment) for notes on how to deploy the project on a live system.
Expand All @@ -76,6 +74,7 @@ For more information, see the [DEPLOYMENT.md](./DEPLOYMENT.md).
- [Python](https://python.org/) - Backend Server Environment
- [FastAPI](https://fastapi.tiangolo.com/) - Backend API Web Framework
- [LlamaIndex](https://www.llamaindex.ai/) - Data Framework for LLM
- [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama) - LlamaIndex Application Bootstrap Tool

## 📑 Contributing <a name = "contributing"></a>

Expand Down
24 changes: 18 additions & 6 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

The backend is built using Python & [FastAPI](https://fastapi.tiangolo.com/) bootstrapped with [`create-llama`](https://github.com/run-llama/LlamaIndexTS/tree/main/packages/create-llama).

To get started, you must first install the required dependencies in `Requirements` section below, then follow the `Getting Started` section.

## Requirements

1. Python >= 3.11
Expand All @@ -21,9 +23,14 @@ The backend is built using Python & [FastAPI](https://fastapi.tiangolo.com/) boo

## Getting Started

First, ensure if you want to use the cuda version of pytorch, you have the correct version `cuda > 12.1` of cuda installed. You can check this by running `nvcc --version or nvidia-smi` in your terminal. If you do not have cuda installed, you can install it from [here](https://developer.nvidia.com/cuda-downloads).
First, ensure if you want to use the cuda version of pytorch, you have the correct version `cuda > 12.1` of cuda installed. You can check this by running `nvcc --version or nvidia-smi` in your terminal, nvcc --version should correctly chow whether you have installed cuda properly or not. If you do not have cuda installed, you can install it from [here](https://developer.nvidia.com/cuda-downloads).

- You may need to add cuda to your path, which can be found online.

Ensure you have followed the steps in the `requirements` section above.

Ensure you have followed the steps in the requirements section above.
- If on windows, make sure you are running the commands in powershell.
- Add conda to your path, which can be found [here](https://stackoverflow.com/questions/64149680/how-can-i-activate-a-conda-environment-from-powershell)

Then activate the conda environment:

Expand All @@ -33,24 +40,29 @@ conda activate SmartRetrieval

Second, setup the environment:

```bash
# Only run one of the following commands:
```powershell
# Only choose one of the options below depending on if you have CUDA enabled GPU or not:
# If running on windows, make sure you are running the commands in powershell.
-----------------------------------------------
# Install dependencies and torch (cpu version)
# Go to the backend directory and edit the pyproject.toml file to uncomment the `torch-cpu` poetry section
-----------------------------------------------
# Windows: Set env for llama-cpp-python with openblas support on cpu
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
# Linux: Set env for llama-cpp-python with openblas support on cpu
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
# Then:
poetry install --without torch-cuda
poetry install --with torch-cpu
-----------------------------------------------
# Install dependencies and torch (cuda version)
# Installing torch with cuda support on a system without cuda support is also possible.
-----------------------------------------------
# Windows: Set env for llama-cpp-python with cuda support on gpu
$env:CMAKE_ARGS = "-DLLAMA_CUBLAS=on"
# Linux: Set env for llama-cpp-python with cuda support on gpu
CMAKE_ARGS="-DLLAMA_CUBLAS=on"
# Then:
poetry install --without torch-cpu
poetry install --with torch-cuda
```

```bash
Expand Down
3 changes: 2 additions & 1 deletion backend/backend/app/utils/contants.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

# Model Constants
MAX_NEW_TOKENS = 4096
CONTEXT_SIZE = MAX_NEW_TOKENS
CONTEXT_SIZE = 3900 # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
DEVICE_TYPE = "cuda" if is_cuda_available() else "cpu"

# Get the current directory
Expand All @@ -18,6 +18,7 @@

# LLM Model Constants
LLM_MODEL_URL = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
LLM_TEMPERATURE = 0.1
# Model Kwargs
# set to at least 1 to use GPU, adjust according to your GPU memory, but must be able to fit the model
MODEL_KWARGS = {"n_gpu_layers": 100} if DEVICE_TYPE == "cuda" else {}
Expand Down
14 changes: 3 additions & 11 deletions backend/backend/app/utils/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
EMBED_MODEL_NAME,
EMBED_POOLING,
LLM_MODEL_URL,
LLM_TEMPERATURE,
MAX_NEW_TOKENS,
MODEL_KWARGS,
NUM_OUTPUT,
Expand All @@ -36,12 +37,11 @@

llm = LlamaCPP(
model_url=LLM_MODEL_URL,
temperature=0.1,
temperature=LLM_TEMPERATURE,
max_new_tokens=MAX_NEW_TOKENS,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=CONTEXT_SIZE,
# kwargs to pass to __call__()
# generate_kwargs={},
generate_kwargs={},
# kwargs to pass to __init__()
model_kwargs=MODEL_KWARGS,
# transform inputs into Llama2 format
Expand All @@ -50,14 +50,6 @@
verbose=True,
)

# define prompt helper
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 0.2

embed_model = HuggingFaceEmbedding(
model_name=EMBED_MODEL_NAME,
pooling=EMBED_POOLING,
Expand Down
186 changes: 186 additions & 0 deletions backend/backend/get_PSSCOC_docs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
import json
import os

import requests
from bs4 import BeautifulSoup
from doc2docx import convert as convert_doc2docx

"""
A web scraping script to download the Public Sector Standard Conditions of Contract (PSSCOC) documents from the BCA website.
The script will create a folder for each category and download the documents into the respective category folder.
The script will also convert doc files to docx files.
The script will also get the About PSSCOC page info and saves to a json file.
"""

# Website URL
base_url = "https://www1.bca.gov.sg"
docs_endpoint = "/procurement/post-tender-stage/public-sector-standard-conditions-of-contract-psscoc"

# Source Documents Folder
source_doc_dir = "data"
# PSSCOC Page Info Folder
psscoc_page_info_dir = "About PSSCOC"


# Get the PSSCOC documents
def get_psscoc_docs():
"""
Downloads the PSSCOC documents.
"""
# Send a GET request to the website URL
response = requests.get(base_url + docs_endpoint, timeout=10)
response.raise_for_status() # Check for HTTP errors

# Parse the HTML content of the response
soup = BeautifulSoup(response.content, "html.parser")

# Find the table element that contains the Conditions of Contract & Downloads link
table = soup.find("table")

# Loop through each row of the table element, skipping the first row
for row in table.find_all("tr")[1:]:
# Find the first cell of the row that contains the category name
first_cell = row.find("td", attrs={"scope": "row"})
category_name = first_cell.find("strong").text.strip()
print("Category:", category_name)

# Create a folder for each category if it doesn't exist
category_folder = os.path.join(source_doc_dir, category_name)
if not os.path.exists(category_folder):
os.makedirs(category_folder)

# Find the second cell of the row that contains the href links
second_cell = row.find("ol")
# Loop through each list item in the second cell
for li in second_cell.find_all("li"):
# Find the href link
href_link = li.find("a")["href"]
# if starts with /docs, then it is a relative link
if href_link.startswith("/docs"):
href_link = base_url + href_link

# Get the filename from the href link
filename = os.path.basename(href_link).split("?")[0]
print("Downloading:", filename)
# Send a GET request to the href link
response = requests.get(href_link, timeout=10)
# Write the response content to a file
with open(os.path.join(category_folder, filename), "wb") as f:
f.write(response.content)
print("Saved to:", os.path.join(category_folder, filename))
# convert doc to docx
if filename.endswith(".doc"):
print("Converting to docx...")
convert_doc2docx(os.path.join(category_folder, filename))
print(
"Converted to:",
os.path.join(category_folder, filename + "x"),
)
# remove the original doc file
os.remove(os.path.join(category_folder, filename))
# line break
print("-" * 100)


def get_psscoc_page_info():
"""
Get the About PSSCOC page info and saves to a json file.
"""
print("Getting PSSCOC Page Info...")
# Send a GET request to the website URL
response = requests.get(base_url + docs_endpoint, timeout=10)
response.raise_for_status() # Check for HTTP errors

# Parse the HTML content of the response
soup = BeautifulSoup(response.content, "html.parser")

# Extract the necessary HTML elements
mid_body = soup.find("div", attrs={"class": "mid"})

cleaned_results = {}

# Extract title from the mid_body
title = mid_body.find("div", attrs={"class": "title"}).text.strip()
cleaned_results["Title"] = title

# Extract the flow content from the mid_body
flow_content = mid_body.find("ul", attrs={"class": "rsmFlow rsmLevel rsmOneLevel"})
flow = [li.text.strip() for li in flow_content.find_all("li")]
cleaned_results["Stage of PSSCOC"] = " > ".join(flow[1:])

# Extract sfContentBlock from the mid_body
sf_content_block = mid_body.find("div", attrs={"class": "sfContentBlock"})

# Extract all the paragraphs from the sfContentBlock but not nested paragraphs
paragraphs = sf_content_block.find_all("p", recursive=False)
paragraphs_text = [p.text.strip() for p in paragraphs]
cleaned_results["About"] = paragraphs_text[0]
more_about = {}
more_about[paragraphs_text[1]] = (
paragraphs_text[2]
.replace("\n", " ")
.replace("\r", " ")
.replace("\u00a0", " ")
.strip()
)

# Extract ul from the sfContentBlock
ul_content = sf_content_block.find("ul")
ul_li = [li.text.strip() for li in ul_content.find_all("li")]
more_about[paragraphs_text[3]] = ul_li # title for 3rd paragraph

cleaned_results["More About"] = more_about

# Extract the table from the sfContentBlock
table = sf_content_block.find("table")
table_rows = table.find_all("tr")
header_row = [td.text.strip() for td in table_rows[0].find_all("td")]
header_row = ["Category Name", "Category File Names"]
table_data_list = []
for row in table_rows[1:]:
row_data = [td.text.strip() for td in row.find_all("td")]
row_data[-1] = row_data[-1].split("\n")
# remove empty string
row_data[-1] = [x.strip() for x in row_data[-1] if x]
table_data = dict(zip(header_row, row_data))
table_data_list.append(table_data)
cleaned_results["Categories of PSSCOC"] = table_data_list

# Extract all the divs from the sfContentBlock but not nested divs
# divs = sf_content_block.find_all("div", recursive=False)

# Save the json results content to a file
page_info_folder = os.path.join(source_doc_dir, psscoc_page_info_dir)
file_name = "About PSSCOC.json"
# Create a folder for page info if it doesn't exist
if not os.path.exists(page_info_folder):
os.makedirs(page_info_folder)
# Save the results content to a file
with open(os.path.join(page_info_folder, file_name), "wb") as f:
f.write(json.dumps(cleaned_results, indent=4).encode("utf-8"))
print(
"Saved to:",
os.path.join(page_info_folder, file_name),
)


# Main function
def main():
"""
Main function.
"""
try:
# Get the PSSCOC documents
get_psscoc_docs()
# Get the About PSSCOC page info
get_psscoc_page_info()
except requests.exceptions.RequestException as e:
print("Error: Failed to make a request to the website.")
print(e)
except Exception as e:
print("An unexpected error occurred:")
print(e)


if __name__ == "__main__":
main()
7 changes: 7 additions & 0 deletions backend/backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from dotenv import load_dotenv
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import RedirectResponse
from torch.cuda import is_available as is_cuda_available

from backend.app.api.routers.chat import chat_router
Expand Down Expand Up @@ -55,3 +56,9 @@

# Try to create the index first on startup
create_index()


# Redirect to the /docs endpoint
@app.get("/")
async def docs_redirect():
return RedirectResponse(url="/docs")
Loading

0 comments on commit 7d9d30d

Please sign in to comment.