Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current state of app and repo #4

Merged
merged 75 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
e0aa2eb
First App Commit Concurrent Test
Nov 27, 2024
587368a
Tweaked workflow
Nov 27, 2024
4ffa301
Trying again
Nov 27, 2024
848af7c
test
Nov 27, 2024
bfd18d6
Initial sample chainlit
Nov 27, 2024
59d29cb
formatted README
Nov 27, 2024
74eb75e
Added sample knowledgebase
Nov 27, 2024
91b7015
Sample Chainlit app placeholder
Nov 27, 2024
cb13d1e
Updated Dockerfile with same port as chainlit app
Nov 27, 2024
6c1e5e3
Exposing different port
Nov 27, 2024
69ec05a
updating app.py; author: Enrico Collautti [email protected]
b3v Dec 3, 2024
0283da1
adding testing script
b3v Dec 4, 2024
a0b734e
changed embedding model to HuggingFace mpnet
b3v Dec 4, 2024
540424a
changing from chainlit to streamlit
b3v Dec 4, 2024
09ac7fa
adjust streamlit rag app and dependencies files
b3v Dec 5, 2024
9b667a3
test change to footballplayers.json
Dec 5, 2024
08c6b0b
test hf concurrence
Dec 5, 2024
b296661
loading scripts and app stuff
Dephoh Dec 9, 2024
d914bb1
Merge branch 'app_dev' of https://github.com/BU-Spark/ml-bpl-rag into…
Dephoh Dec 9, 2024
bae6517
Our current state of writeup including instructions on how to run a p…
Dephoh Dec 9, 2024
ecb3a4c
oops, fixed the reqs
Dephoh Dec 9, 2024
89e7a42
oops x 2, fixed reqs
Dephoh Dec 9, 2024
26a5c64
Added query alignment to retrieval
Dephoh Dec 9, 2024
1ba93a0
Added query alignment
Dephoh Dec 9, 2024
fea960e
Updated xml parse call for query alignment
Dephoh Dec 9, 2024
b609620
passed original query to reranker
Dephoh Dec 9, 2024
bfbadc3
Added timers, trying to figure out slowness
Dephoh Dec 10, 2024
16f0715
moved vectorstore initialization outside of RAG
Dephoh Dec 10, 2024
fa714bc
transferring initializations to st.session_state
Dephoh Dec 10, 2024
83b3fa4
fixed input errors... hopefully
Dephoh Dec 10, 2024
148d8c2
fixed input errors pt 2
Dephoh Dec 10, 2024
ea5cbee
added time logging to RAG.py
Dephoh Dec 10, 2024
3e602d9
more robust time analysis for reranking
Dephoh Dec 10, 2024
d1935d1
removed prompt alignment
Dephoh Dec 10, 2024
bedac8d
Usage Monitoring
Dephoh Dec 10, 2024
30a4d62
Updated aesthetics on README
Dephoh Dec 10, 2024
ca4996e
updated monitoring
Dephoh Dec 10, 2024
ae94c4e
Create Product_Testing.md
EnricoCollautti Dec 10, 2024
8274e73
Changed some text
Dephoh Dec 10, 2024
2b87ca1
Merge branch 'app_dev' of https://github.com/BU-Spark/ml-bpl-rag into…
Dephoh Dec 10, 2024
cf15415
added image feature
b3v Dec 10, 2024
29b3738
added image display capabilities
b3v Dec 10, 2024
09fe857
updated correct file
b3v Dec 10, 2024
f78d694
upload correct file
b3v Dec 10, 2024
7a5560a
update requirements
b3v Dec 10, 2024
1d172f1
Solution for too large input
Dephoh Dec 11, 2024
07f86c6
moved length checking upstream to retrieval
Dephoh Dec 11, 2024
3eedce1
added ongoing challenges/future considerations to writeup.md
Dephoh Dec 11, 2024
4ef2f39
Updated default retrieval text
Dephoh Dec 11, 2024
2329708
added Digital Commonwealth context
Dephoh Dec 11, 2024
42a5a2d
testing new fields for reranking and scaling up initial retrieved docs
Dephoh Dec 11, 2024
ae94932
returned retrieved back to 100, attempt at more structured queries
Dephoh Dec 11, 2024
48ae0fa
added n workers parallelism to metadata fetching
Dephoh Dec 11, 2024
8547e22
accidentally removed extract_text_fromo_json
Dephoh Dec 11, 2024
706b16d
returning ranked docs only
Dephoh Dec 11, 2024
0716283
1.5 workers for parallel. Avoiding rate limits
Dephoh Dec 11, 2024
5c86e88
back to n workers set to 1. rate limiting
Dephoh Dec 11, 2024
2b516d5
reverted back to pre-parallel version
Dephoh Dec 11, 2024
a9e136f
Finishing touches and cleanup
Dephoh Dec 12, 2024
364893a
Forgot to include everyone on last commit + old scripts
Dephoh Dec 12, 2024
533edbd
Delete app.py
Dephoh Dec 13, 2024
d094b68
Delete app1.1.py
Dephoh Dec 13, 2024
3ac9c32
Delete EVALUATION.md
Dephoh Dec 13, 2024
1e0a968
Rename Product_Testing.md to EVALUATION.md
Dephoh Dec 13, 2024
b842841
Delete bpl_scraper.py
Dephoh Dec 13, 2024
a0a747d
Delete knowledge.txt
Dephoh Dec 13, 2024
388d2d5
Delete football_players.json
Dephoh Dec 13, 2024
1e9dc21
Delete new_streamlit.py
Dephoh Dec 13, 2024
db7bb3d
Delete streamlit-rag-app.py
Dephoh Dec 13, 2024
8e53b79
Delete test.txt
Dephoh Dec 13, 2024
62e78cf
Update faiss_migrate.ipynb
Dephoh Dec 13, 2024
4664331
Copied contents of WRITEUP into README
Dephoh Dec 16, 2024
2a7ca6c
Final reorganizing and adding some of our previous work
Dec 16, 2024
559abae
Delete PoC directory
Dephoh Dec 16, 2024
c6230d9
Merge branch 'dev' into app_dev
Dephoh Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/hf_concurrent.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Push Production to HuggingFace



on:

push:

branches: [app_dev]



# run this workflow manualy from the Actions tab

workflow_dispatch:



jobs:

sync-to-hub:

runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v4

with:

fetch-depth: 0

lfs: true

- name: Deploy Production (app_dev) to HuggingFace

env:

HF_TOKEN: ${{ secrets.HF_TOKEN }}

run: git push --force https://DF38:[email protected]/spaces/spark-ds549/LibRAG app_dev:main
38 changes: 38 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
FROM python:3.12.4

# Create a non-root user

RUN useradd -m -u 1000 user
USER user

# Set PATH to include user's local bin
ENV PATH="/home/user/.local/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy requirements file with appropriate ownership
COPY --chown=user ./requirements.txt requirements.txt

# Install dependencies

RUN pip install --no-cache-dir -r requirements.txt
RUN pip install rank_bm25

# Copy application files with appropriate ownership

COPY --chown=user . /app

# Set environment variables for Streamlit

ENV HOST=0.0.0.0

ENV PORT=7860

ENV STREAMLIT_SERVER_PORT=7860

ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0

# Change the CMD to use chainlit

CMD ["streamlit", "run", "streamlit_app.py", "--server.port", "7860", "--server.address", "0.0.0.0"]
98 changes: 98 additions & 0 deletions EVALUATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Analysis of RAG Responses

This document analyzes the performance of a Retrieval-Augmented Generation (RAG) system tested on queries categorized into three typologies: specific queries, vague queries, and gibberish or unknown facts. Below are detailed insights into each category, based on the test results and observations.

## Specific Queries

### Examples

- **Query:** Where is boston?

**Response:** Boston is located in the state of Massachusetts, in Suffolk County, United States, which is part of North and Central America.

**Source:** While the response mentions Suffolk County, the sources only contained this as a keyword and did not explicitly state the connection.

- **Query:** Who was Hitler?

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Source:** The sources contained information about World War II but did not specifically address Hitler, even if the BPL includes material about him.

- **Query:** Do you know Ken Follet?

**Response:** Based on the provided context, there is no information available about Ken Follet.

**Sources:** The sources retrieved were not related to Ken Follett. It appears that the Boston Public Library (BPL) does not have any material about him.

- **Query:** Do you know the book 1984?

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** The retrieved sources were not related to 1984. The BPL database lacks content specific to this book.

### Notes
The RAG system can easily respond with, "Sorry, I was unable to find any documents relevant to your query," when it lacks specific information to generate a meaningful answer. However, when it does provide answers, they are typically well-constructed and accurate. For the RAG to function effectively, the documents it accesses must also be highly specific. For example, while there may be numerous images of Hitler, the absence of descriptive text accompanying them makes it impossible for the RAG to provide a comprehensive answer. Additionally, since we are working with the Digital Commonwealth rather than a traditional library catalog, it is understandable that there might not be sufficient material on topics like Ken Follett or the book 1984.

## Vague Queries

### Examples
- **Query:** Tell me about neurology.

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** While the response indicates no relevant documents, the sources reportedly contained meaningful information about neurology.

- **Query:** Boston.

**Response:** The context provides a comprehensive overview of Boston, highlighting its historical and geographical aspects. It includes a south-east view of the city, a view of the State Capitol, and a depiction of the regional open space system. It also provides historical maps of Boston, including one from the 18th century that shows the city's maritime economy and settlement patterns, and a re-issue of this map from 1835. The context also mentions various landmarks, such as the Ether Monument Public Garden, the site of Anne Hutchinson’s house, and the John Hancock house. The city's history is also touched upon, with references to its settlement by Europeans in 1630 and its previous habitation by Native Americans.

**Sources:** The sources are highly relevant and align well with the query, providing detailed historical and geographical information about Boston.

- **Query:** What happened in 1984?

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** All retrieved sources were editions of the same journal from 1984. While temporally related, the content did not address the question directly.

### Notes
Vague queries pose an interesting challenge for the RAG system. The RAG system often responds with, "Sorry, I was unable to find any documents relevant to your query," even when it retrieves some interesting and relevant sources. It appears that only sufficiently descriptive sources enable the system to generate proper answers. When it does provide an answer, it is typically accurate and on point. However, the database is structured in a way that allows multiple editions of the same journal to exist, leading the RAG to retrieve all editions as meaningful sources, which can be somewhat redundant. To improve the user experience, it would be helpful to limit these cases by reporting only one edition of a journal.

## Gibberish and Unknown Facts

### Examples
- **Query:** Do you know De Bruyne?

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** As expected, the response indicates no knowledge of De Bruyne. The retrieved sources were about someone with a similar name.

- **Query:** Qwghkjhgf.

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** The retrieved sources were random, as expected for a nonsensical query.

- **Query:** Hello, I really like dogs, can you give me some dogs please? Thank you!

**Response:** Sorry, I was unable to find any documents relevant to your query.

**Sources:** The retrieved sources were about dogs.

### Notes
The system is behaving as expected: when no relevant answer can be generated, the sources retrieved are either random or associated based on similarity in names or words. For instance, a query about "de Bruyne" results in "Les femmes de Bruges," as these are the most similar terms in the database. (Since de Bruyne is a contemporary football player, it is entirely reasonable that the Digital Commonwealth does not contain any information about him.)

## Query in Different Language (Italian)

### Example

- **Query:** Ciao, dove si trova boston?

**Response:** Boston si trova negli Stati Uniti, nello stato del Massachusetts. / Sorry, I was unable to find any documents relevant to your query.

**Source:** The sources are about Boston, but not as the ones for the same English query / The sources are about Italy, but not related to Boston itself (e.g., Milan or Rome).

### Notes
Working with another language makes it challenging to receive the same answer consistently. Sometimes, the system provides the correct response (identical to the English version but translated into Italian) and sometimes the default message: "Sorry, I was unable to find any documents relevant to your query." Additionally, the sources retrieved vary from case to case, and the accuracy of the answer seems to depend on the quality and relevance of these sources. It's interesting to see how an Italian query can correspond to sources about Italy and not about the query itself.

## Final Disclaimer
This test was conducted on a partial database. The inability of the RAG system to find specific information may be due to the absence of relevant data in the current product configuration, even though such information might exist in the complete database.
211 changes: 211 additions & 0 deletions RAG.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
import getpass
import os
import time
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
from langchain_huggingface import HuggingFaceEmbeddings
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import re
from langchain_core.documents import Document
from langchain_community.retrievers import BM25Retriever
import requests
from typing import Dict, Any, Optional, List, Tuple
import json
import logging

def retrieve(query: str,vectorstore:PineconeVectorStore, k: int = 1000) -> Tuple[List[Document], List[float]]:
start = time.time()
# pinecone_api_key = os.getenv("PINECONE_API_KEY")
# pc = Pinecone(api_key=pinecone_api_key)

# index = pc.Index(index_name)
# vector_store = PineconeVectorStore(index=index, embedding=embeddings)
results = vectorstore.similarity_search_with_score(
query,
k=k,
)
documents = []
scores = []
for res, score in results:
# check to make sure response isnt too long for context window of 4o-mini
if len(res.page_content) > 4000:
res.page_content = res.page_content[:4000]
documents.append(res)
scores.append(score)
logging.info(f"Finished Retrieval: {time.time() - start}")
return documents, scores

def safe_get_json(url: str) -> Optional[Dict]:
"""Safely fetch and parse JSON from a URL."""
print("Fetching JSON")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.json()
except Exception as e:
logging.error(f"Error fetching from {url}: {str(e)}")
return None

def extract_text_from_json(json_data: Dict) -> str:
"""Extract text content from JSON response."""
if not json_data:
return ""

text_parts = []

# Handle direct text fields
text_fields = ["title_info_primary_tsi","abstract_tsi","subject_geographic_sim","genre_basic_ssim","genre_specific_ssim","date_tsim"]
for field in text_fields:
if field in json_data['data']['attributes'] and json_data['data']['attributes'][field]:
# print(json_data[field])
text_parts.append(str(json_data['data']['attributes'][field]))

return " ".join(text_parts) if text_parts else "No content available"

def rerank(documents: List[Document], query: str) -> List[Document]:
"""Ingest more metadata. Rerank documents using BM25"""
start = time.time()
if not documents:
return []

full_docs = []
meta_start = time.time()
for doc in documents:
if not doc.metadata.get('source'):
continue

url = f"https://www.digitalcommonwealth.org/search/{doc.metadata['source']}"
json_data = safe_get_json(f"{url}.json")

if json_data:
text_content = extract_text_from_json(json_data)
if text_content: # Only add documents with actual content
full_docs.append(Document(page_content=text_content, metadata={"source":doc.metadata['source'],"field":doc.metadata['field'],"URL":url}))
logging.info(f"Took {time.time()-meta_start} seconds to retrieve all metadata")
# If no valid documents were processed, return empty list
if not full_docs:
return []

# Create BM25 retriever with the processed documents
reranker = BM25Retriever.from_documents(full_docs, k=min(10, len(full_docs)))
reranked_docs = reranker.invoke(query)
logging.info(f"Finished reranking: {time.time()-start}")
return reranked_docs

def parse_xml_and_query(query:str,xml_string:str) -> str:
"""parse xml and return rephrased query"""
if not xml_string:
return "No response generated."

pattern = r"<(\w+)>(.*?)</\1>"
matches = re.findall(pattern, xml_string, re.DOTALL)
parsed_response = dict(matches)
if parsed_response.get('VALID') == 'NO':
return query
return parsed_response.get('STATEMENT', query)


def parse_xml_and_check(xml_string: str) -> str:
"""Parse XML-style tags and handle validation."""
if not xml_string:
return "No response generated."

pattern = r"<(\w+)>(.*?)</\1>"
matches = re.findall(pattern, xml_string, re.DOTALL)
parsed_response = dict(matches)

if parsed_response.get('VALID') == 'NO':
return "Sorry, I was unable to find any documents for your query.\n\n Here are some documents I found that might be relevant."

return parsed_response.get('RESPONSE', "No response found in the output")

def RAG(llm: Any, query: str,vectorstore:PineconeVectorStore, top: int = 10, k: int = 100) -> Tuple[str, List[Document]]:
"""Main RAG function with improved error handling and validation."""
start = time.time()
try:

# Query alignment is commented our, however I have decided to leave it in for potential future use.

# Retrieve initial documents using rephrased query -- not working as intended currently, maybe would be better for data with more words.
# query_template = PromptTemplate.from_template(
# """
# Your job is to think about a query and then generate a statement that only includes information from the query that would answer the query.
# You will be provided with a query in <QUERY></QUERY> tags.
# Then you will think about what kind of information the query is looking for between <REASONING></REASONING> tags.
# Then, based on the reasoning, you will generate a sample response to the query that only includes information from the query between <STATEMENT></STATEMENT> tags.
# Afterwards, you will determine and reason about whether or not the statement you generated only includes information from the original query and would answer the query between <DETERMINATION></DETERMINATION> tags.
# Finally, you will return a YES, or NO response between <VALID></VALID> tags based on whether or not you determined the statment to be valid.
# Let me provide you with an exmaple:

# <QUERY>I would really like to learn more about Bermudan geography<QUERY>

# <REASONING>This query is interested in geograph as it relates to Bermuda. Some things they might be interested in are Bermudan climate, towns, cities, and geography</REASONING>

# <STATEMENT>Bermuda's Climate is [blank]. Some of Bermuda's cities and towns are [blank]. Other points of interested about Bermuda's geography are [blank].</STATEMENT>

# <DETERMINATION>The query originally only mentions bermuda and geography. The answers do not provide any false information, instead replacing meaningful responses with a placeholder [blank]. If it had hallucinated, it would not be valid. Because the statements do not hallucinate anything, this is a valid statement.</DETERMINATION>

# <VALID>YES</VALID>

# Now it's your turn! Remember not to hallucinate:

# <QUERY>{query}</QUERY>
# """
# )
# query_prompt = query_template.invoke({"query":query})
# query_response = llm.invoke(query_prompt)
# new_query = parse_xml_and_query(query=query,xml_string=query_response.content)
logging.info(f"\n---\nQUERY: {query}")

retrieved, _ = retrieve(query=query, vectorstore=vectorstore, k=k)
if not retrieved:
return "No documents found for your query.", []

# Rerank documents
reranked = rerank(documents=retrieved, query=query)
if not reranked:
return "Unable to process the retrieved documents.", []

# Prepare context from reranked documents
context = "\n\n".join(doc.page_content for doc in reranked[:top] if doc.page_content)
if not context.strip():
return "No relevant content found in the documents.", []
# change for the sake of another commit
# Prepare prompt
answer_template = PromptTemplate.from_template(
"""Pretend you are a professional librarian. Please Summarize The Following Context as though you had retrieved it for a patron:
Context:{context}
Make sure to answer in the following format
First, reason about the answer between <REASONING></REASONING> headers,
based on the context determine if there is sufficient material for answering the exact question,
return either <VALID>YES</VALID> or <VALID>NO</VALID>
then return a response between <RESPONSE></RESPONSE> headers:
Here is an example
<EXAMPLE>
<QUERY>Are pineapples a good fuel for cars?</QUERY>
<CONTEXT>Cars use gasoline for fuel. Some cars use electricity for fuel.Tesla stock has increased by 10 percent over the last quarter.</CONTEXT>
<REASONING>Based on the context pineapples have not been explored as a fuel for cars. The context discusses gasoline, electricity, and tesla stock, therefore it is not relevant to the query about pineapples for fuel</REASONING>
<VALID>NO</VALID>
<RESPONSE>Pineapples are not a good fuel for cars, however with further research they might be</RESPONSE>
</EXAMPLE>
Now it's your turn
<QUERY>
{query}
</QUERY>"""
)

# Generate response
ans_prompt = answer_template.invoke({"context": context, "query": query})
response = llm.invoke(ans_prompt)

# Parse and return response
parsed = parse_xml_and_check(response.content)
logging.info(f"RESPONSE: {parsed}\nRETRIEVED: {reranked}")
logging.info(f"RAG Finished: {time.time()-start}\n---\n")
return parsed, reranked

except Exception as e:
logging.error(f"Error in RAG function: {str(e)}")
return f"An error occurred while processing your query: {str(e)}", []
Loading
Loading