Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: RAG system #5502

Closed
FaDavid98 opened this issue Jun 13, 2024 · 9 comments
Closed

[Usage]: RAG system #5502

FaDavid98 opened this issue Jun 13, 2024 · 9 comments
Labels
usage How to use vllm

Comments

@FaDavid98
Copy link

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I want to run a RAG system using vLLM. Is it supported or not. I want to use the vLLM to use a llm model, pass the relevant docs to it and get the answer from it. I can't define a prompt template with the context and question. Can someone help me with this?

@FaDavid98 FaDavid98 added the usage How to use vllm label Jun 13, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 13, 2024

Assuming that the retrieval step has been done externally, you can apply your own template to the result and pass the formatted string to the model via LLM.generate().

@FaDavid98
Copy link
Author

FaDavid98 commented Jun 13, 2024

So i should get the relevant context from the retriever and pass it to the LLM.generate()? Then where should i pass the question? And how shoul i define a prompt and pass it to the LLM?

@DarkLight1337
Copy link
Member

So i should get the relevant context from the retriever and pass it to the LLM.generate()? Then where should i pass the question?

You should use a library that's dedicated to RAG to perform the retrieval based on the question. vLLM can only handle the model generation part.

@FaDavid98
Copy link
Author

FaDavid98 commented Jun 13, 2024

`from langchain.llms import VLLM
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv
import time
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from fastapi import HTTPException
from pydantic import BaseModel

def create_query_engine(vectorstore_path):
embeddings = FastEmbedEmbeddings(model_name='BAAI/bge-large-en-v1.5')
new_vectorstore = FAISS.load_local(vectorstore_path, embeddings, allow_dangerous_deserialization=True)

llm = VLLM(
    model="mistralai/Mistral-7B-v0.1",
    gpu_memory_utilization=0.95,
    tensor_parallel_size=1, #number of GPUs to use
    trust_remote_code=True,
    max_new_tokens=50,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.95,
        "max_model_len": 16384,
        "enforce_eager": True,
    },
)

system_prompt = """As a support engineer, your role is to leverage the information 
in the context provided. Your task is to respond to queries based strictly 
on the information available in the provided context. Do not create new 
information under any circumstances. Refrain from repeating yourself. 
Extract your response solely from the context mentioned above. 
If the context does not contain relevant information for the question, 
respond with 'I don't know because it is not mentioned in the context.'
Answer for the question shortly in 1 sentence.

CONTEXT:
{context}

QUESTION:
{question}"""

PROMPT = PromptTemplate(template=system_prompt, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=new_vectorstore.as_retriever(search_kwargs={"k": 3}),
    chain_type_kwargs=chain_type_kwargs
)

return qa, new_vectorstore

def get_relevant_docs(question, retriever):
relevant_docs = retriever.get_relevant_documents(question)
return relevant_docs

def get_query_response(query: str, query_engine):
try:
response = query_engine.run(query)
return {"response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

def chat():
# Load and preprocess documents
vectorstore_path = "lcal_fifa"

# Create query engine
query_engine, new_vectorstore = create_query_engine(vectorstore_path)

# Chat loop
print("Start chatting with the AI. Type 'exit' to stop.")
while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit"]:
        print("Exiting chat. Goodbye!")
        break
    
    relevant_docs = get_relevant_docs(query, new_vectorstore.as_retriever(search_kwargs={"k": 5}))
    print("\nRelevant documents:")
    for doc in relevant_docs:
        print(f"- {doc.page_content[:200]}...")  # Print the first 200 characters of each relevant document

    response = get_query_response(query, query_engine)
    print("\nAI:", response["response"])

if name == 'main':
chat()
`
this is the code i tried to use. But when i tested it seems that the model doesn't understand my prompt. How should i format it?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 13, 2024

It appears that you're using the LangChain integration. In that case, you should probably ask over at their repo since you're asking about how to use the integration itself rather than how to use vLLM directly.

(The LangChain integration is not part of this repo)

@FaDavid98
Copy link
Author

I know it is not part of this repo, but if i could replace that langchain part with this repo and it would work thats totally perfect.

@FaDavid98
Copy link
Author

So can it be done with this vLLM or not?

@DarkLight1337
Copy link
Member

So can it be done with this vLLM or not?

As mentioned before, vLLM can only handle the model generation part. If you're not using the LangChain integration, then you have to write your own code to link together the different components of RAG.

@DarkLight1337 DarkLight1337 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024
@devjarvis-coder
Copy link

`from langchain.llms import VLLM from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores import FAISS from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate from dotenv import load_dotenv import time from langchain_community.embeddings.fastembed import FastEmbedEmbeddings from fastapi import HTTPException from pydantic import BaseModel

def create_query_engine(vectorstore_path): embeddings = FastEmbedEmbeddings(model_name='BAAI/bge-large-en-v1.5') new_vectorstore = FAISS.load_local(vectorstore_path, embeddings, allow_dangerous_deserialization=True)

llm = VLLM(
    model="mistralai/Mistral-7B-v0.1",
    gpu_memory_utilization=0.95,
    tensor_parallel_size=1, #number of GPUs to use
    trust_remote_code=True,
    max_new_tokens=50,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.95,
        "max_model_len": 16384,
        "enforce_eager": True,
    },
)

system_prompt = """As a support engineer, your role is to leverage the information 
in the context provided. Your task is to respond to queries based strictly 
on the information available in the provided context. Do not create new 
information under any circumstances. Refrain from repeating yourself. 
Extract your response solely from the context mentioned above. 
If the context does not contain relevant information for the question, 
respond with 'I don't know because it is not mentioned in the context.'
Answer for the question shortly in 1 sentence.

CONTEXT:
{context}

QUESTION:
{question}"""

PROMPT = PromptTemplate(template=system_prompt, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=new_vectorstore.as_retriever(search_kwargs={"k": 3}),
    chain_type_kwargs=chain_type_kwargs
)

return qa, new_vectorstore

def get_relevant_docs(question, retriever): relevant_docs = retriever.get_relevant_documents(question) return relevant_docs

def get_query_response(query: str, query_engine): try: response = query_engine.run(query) return {"response": response} except Exception as e: raise HTTPException(status_code=500, detail=str(e))

def chat(): # Load and preprocess documents vectorstore_path = "lcal_fifa"

# Create query engine
query_engine, new_vectorstore = create_query_engine(vectorstore_path)

# Chat loop
print("Start chatting with the AI. Type 'exit' to stop.")
while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit"]:
        print("Exiting chat. Goodbye!")
        break
    
    relevant_docs = get_relevant_docs(query, new_vectorstore.as_retriever(search_kwargs={"k": 5}))
    print("\nRelevant documents:")
    for doc in relevant_docs:
        print(f"- {doc.page_content[:200]}...")  # Print the first 200 characters of each relevant document

    response = get_query_response(query, query_engine)
    print("\nAI:", response["response"])

if name == 'main': chat() ` this is the code i tried to use. But when i tested it seems that the model doesn't understand my prompt. How should i format it?
i also want this if you have an answer pls share with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants