[Usage]: RAG system #5502

FaDavid98 · 2024-06-13T11:48:45Z

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I want to run a RAG system using vLLM. Is it supported or not. I want to use the vLLM to use a llm model, pass the relevant docs to it and get the answer from it. I can't define a prompt template with the context and question. Can someone help me with this?

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-06-13T11:59:38Z

Assuming that the retrieval step has been done externally, you can apply your own template to the result and pass the formatted string to the model via LLM.generate().

FaDavid98 · 2024-06-13T12:10:03Z

So i should get the relevant context from the retriever and pass it to the LLM.generate()? Then where should i pass the question? And how shoul i define a prompt and pass it to the LLM?

DarkLight1337 · 2024-06-13T12:13:18Z

So i should get the relevant context from the retriever and pass it to the LLM.generate()? Then where should i pass the question?

You should use a library that's dedicated to RAG to perform the retrieval based on the question. vLLM can only handle the model generation part.

FaDavid98 · 2024-06-13T12:16:05Z

`from langchain.llms import VLLM
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv
import time
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from fastapi import HTTPException
from pydantic import BaseModel

def create_query_engine(vectorstore_path):
embeddings = FastEmbedEmbeddings(model_name='BAAI/bge-large-en-v1.5')
new_vectorstore = FAISS.load_local(vectorstore_path, embeddings, allow_dangerous_deserialization=True)

llm = VLLM(
    model="mistralai/Mistral-7B-v0.1",
    gpu_memory_utilization=0.95,
    tensor_parallel_size=1, #number of GPUs to use
    trust_remote_code=True,
    max_new_tokens=50,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.95,
        "max_model_len": 16384,
        "enforce_eager": True,
    },
)

system_prompt = """As a support engineer, your role is to leverage the information 
in the context provided. Your task is to respond to queries based strictly 
on the information available in the provided context. Do not create new 
information under any circumstances. Refrain from repeating yourself. 
Extract your response solely from the context mentioned above. 
If the context does not contain relevant information for the question, 
respond with 'I don't know because it is not mentioned in the context.'
Answer for the question shortly in 1 sentence.

CONTEXT:
{context}

QUESTION:
{question}"""

PROMPT = PromptTemplate(template=system_prompt, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=new_vectorstore.as_retriever(search_kwargs={"k": 3}),
    chain_type_kwargs=chain_type_kwargs
)

return qa, new_vectorstore

def get_relevant_docs(question, retriever):
relevant_docs = retriever.get_relevant_documents(question)
return relevant_docs

def get_query_response(query: str, query_engine):
try:
response = query_engine.run(query)
return {"response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

def chat():
# Load and preprocess documents
vectorstore_path = "lcal_fifa"

# Create query engine
query_engine, new_vectorstore = create_query_engine(vectorstore_path)

# Chat loop
print("Start chatting with the AI. Type 'exit' to stop.")
while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit"]:
        print("Exiting chat. Goodbye!")
        break
    
    relevant_docs = get_relevant_docs(query, new_vectorstore.as_retriever(search_kwargs={"k": 5}))
    print("\nRelevant documents:")
    for doc in relevant_docs:
        print(f"- {doc.page_content[:200]}...")  # Print the first 200 characters of each relevant document

    response = get_query_response(query, query_engine)
    print("\nAI:", response["response"])

if name == 'main':
chat()
`
this is the code i tried to use. But when i tested it seems that the model doesn't understand my prompt. How should i format it?

DarkLight1337 · 2024-06-13T12:18:05Z

It appears that you're using the LangChain integration. In that case, you should probably ask over at their repo since you're asking about how to use the integration itself rather than how to use vLLM directly.

(The LangChain integration is not part of this repo)

FaDavid98 · 2024-06-13T12:21:32Z

I know it is not part of this repo, but if i could replace that langchain part with this repo and it would work thats totally perfect.

FaDavid98 · 2024-06-13T12:30:46Z

So can it be done with this vLLM or not?

DarkLight1337 · 2024-06-13T12:47:36Z

So can it be done with this vLLM or not?

As mentioned before, vLLM can only handle the model generation part. If you're not using the LangChain integration, then you have to write your own code to link together the different components of RAG.

devjarvis-coder · 2024-07-04T06:46:26Z

`from langchain.llms import VLLM from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores import FAISS from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate from dotenv import load_dotenv import time from langchain_community.embeddings.fastembed import FastEmbedEmbeddings from fastapi import HTTPException from pydantic import BaseModel

def create_query_engine(vectorstore_path): embeddings = FastEmbedEmbeddings(model_name='BAAI/bge-large-en-v1.5') new_vectorstore = FAISS.load_local(vectorstore_path, embeddings, allow_dangerous_deserialization=True)
llm = VLLM(
    model="mistralai/Mistral-7B-v0.1",
    gpu_memory_utilization=0.95,
    tensor_parallel_size=1, #number of GPUs to use
    trust_remote_code=True,
    max_new_tokens=50,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.95,
        "max_model_len": 16384,
        "enforce_eager": True,
    },
)

system_prompt = """As a support engineer, your role is to leverage the information 
in the context provided. Your task is to respond to queries based strictly 
on the information available in the provided context. Do not create new 
information under any circumstances. Refrain from repeating yourself. 
Extract your response solely from the context mentioned above. 
If the context does not contain relevant information for the question, 
respond with 'I don't know because it is not mentioned in the context.'
Answer for the question shortly in 1 sentence.

CONTEXT:
{context}

QUESTION:
{question}"""

PROMPT = PromptTemplate(template=system_prompt, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=new_vectorstore.as_retriever(search_kwargs={"k": 3}),
    chain_type_kwargs=chain_type_kwargs
)

return qa, new_vectorstore
def get_relevant_docs(question, retriever): relevant_docs = retriever.get_relevant_documents(question) return relevant_docs

def get_query_response(query: str, query_engine): try: response = query_engine.run(query) return {"response": response} except Exception as e: raise HTTPException(status_code=500, detail=str(e))

def chat(): # Load and preprocess documents vectorstore_path = "lcal_fifa"
# Create query engine
query_engine, new_vectorstore = create_query_engine(vectorstore_path)

# Chat loop
print("Start chatting with the AI. Type 'exit' to stop.")
while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit"]:
        print("Exiting chat. Goodbye!")
        break
    
    relevant_docs = get_relevant_docs(query, new_vectorstore.as_retriever(search_kwargs={"k": 5}))
    print("\nRelevant documents:")
    for doc in relevant_docs:
        print(f"- {doc.page_content[:200]}...")  # Print the first 200 characters of each relevant document

    response = get_query_response(query, query_engine)
    print("\nAI:", response["response"])
if name == 'main': chat() ` this is the code i tried to use. But when i tested it seems that the model doesn't understand my prompt. How should i format it?
i also want this if you have an answer pls share with

FaDavid98 added the usage How to use vllm label Jun 13, 2024

DarkLight1337 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024

noooop mentioned this issue Sep 19, 2024

[RFC]: Support encode only models by Workflow Defined Engine #8453

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: RAG system #5502

[Usage]: RAG system #5502

FaDavid98 commented Jun 13, 2024

DarkLight1337 commented Jun 13, 2024 •

edited

Loading

FaDavid98 commented Jun 13, 2024 •

edited

Loading

DarkLight1337 commented Jun 13, 2024

FaDavid98 commented Jun 13, 2024 •

edited

Loading

DarkLight1337 commented Jun 13, 2024 •

edited

Loading

FaDavid98 commented Jun 13, 2024

FaDavid98 commented Jun 13, 2024

DarkLight1337 commented Jun 13, 2024

devjarvis-coder commented Jul 4, 2024

[Usage]: RAG system #5502

[Usage]: RAG system #5502

Comments

FaDavid98 commented Jun 13, 2024

Your current environment

How would you like to use vllm

DarkLight1337 commented Jun 13, 2024 • edited Loading

FaDavid98 commented Jun 13, 2024 • edited Loading

DarkLight1337 commented Jun 13, 2024

FaDavid98 commented Jun 13, 2024 • edited Loading

DarkLight1337 commented Jun 13, 2024 • edited Loading

FaDavid98 commented Jun 13, 2024

FaDavid98 commented Jun 13, 2024

DarkLight1337 commented Jun 13, 2024

devjarvis-coder commented Jul 4, 2024

DarkLight1337 commented Jun 13, 2024 •

edited

Loading

FaDavid98 commented Jun 13, 2024 •

edited

Loading

FaDavid98 commented Jun 13, 2024 •

edited

Loading

DarkLight1337 commented Jun 13, 2024 •

edited

Loading