LangChain

LangChain is a framework designed to streamline the development of applications powered by large language models (LLMs). It provides the tools and abstractions to build applications for tasks such as natural language processing, question answering, information retrieval, and generative AI. LangChain emphasizes modularity, making it easier to integrate the various components like models, prompts, tools, and memory to create robust applications.

Installation

Further installation instructions can be found on the LangChain

Create a requirements.txt file to list all project dependencies

Start by creating a requirements.txt file to specify all the essential libraries that are required for the project
Include dependencies like LangChain, FAISS, and Hugging Face, along with any other required packages
Save the following content in a file named requirements.txt located in your project directory

streamlit
jupyter
langchain
langchain-core
langchain-community
langchain-huggingface
sentence-transformers
langchain-text-splitters
langchain-mistralai
sentence-transformers
faiss-cpu
mistralai
pymilvus
pydantic==2.5.2
yake
pandas
numpy

Copy the requirements

Copy the requirements.txt into the Docker Container
To make the requirements.txt file accessible within the Docker container, include it using the following command in your Dockerfile

COPY requirements.txt /app/requirements.txt

Install Python dependencies listed in the requirements.txt

After adding the requirements.txt file to the container, run the following command to install the specified dependencies

RUN mamba install --yes --file requirements.txt && mamba clean --all -f -y

Langchain installation using pip

Install Langchain using pip

pip install langchain

qwe

Install additional dependencies using pip

pip install transformers
pip install torch

wer

ert

rty

Configuration

1.Import the required libraries

from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings

Screenshot 2024-11-25 005325

2.Setting up and loading the environment

load_dotenv()
MISTRAL_API_KEY = os.environ.get("MISTRAL_API_KEY")
MILVUS_URI = "/app/milvus/milvus_vector.db"
MODEL_NAME = "sentence-transformers/all-MiniLM-L12-v2"
MAX_TEXT_LENGTH = 5000

3.Document chaining

The create_stuff_documents_chain is used to combine retrieved documents for generating the AI responses

document_chain = create_stuff_documents_chain(chat_model, prompt)

Screenshot 2024-11-25 010108

4.Text splitting

LangChain's RecursiveCharacterTextSplitter splits the large documents into smaller chunks for processing

text_splitter = RecursiveCharacterTextSplitter(
         chunk_size=500,
         chunk_overlap=200,
         is_separator_regex=False
     )
     docs = text_splitter.split_documents(documents)

Implementation

Constructing the Chat Prompt

def create_prompt():
    """
    Create a prompt template for the RAG model

    Returns:
        PromptTemplate: The prompt template for the RAG model
    """
    # Define the prompt template
    PROMPT_TEMPLATE = """
    You are an AI assistant that provides answers strictly based on the provided context. Adhere to these guidelines:
     - Only answer questions based on the content within the <context> tags.
     - If the <context> does not contain information related to the question, respond only with: "I don't have enough information to answer this question."
     - For unclear questions or questions that lack specific context, request clarification from the user.
     - Provide specific, concise ansewrs. Where relevant information includes statistics or numbers, include them in the response.
     - Avoid adding any information, assumption, or external knowledge. Answer accurately within the scope of the given context and do not guess.
     - If information is missing, respond only with: "I don't have enough information to answer this question."
    """

    prompt = ChatPromptTemplate.from_messages([
        ("system", PROMPT_TEMPLATE),
        ("human", "<question>{input}</question>\n\n<context>{context}</context>"),
    ])
    print("Prompt Created")

    return prompt

Screenshot 2024-11-24 214648

Document Processing

Documents are preprocessed using the RecursiveCharacterTextSplitter from the LangChain to ensure they are manageable for retrieval

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
     docs = text_splitter.split_documents(documents)

Screenshot 2024-11-25 011144

Retriever Logic

A custom retriever (ScoreThresholdRetriever) is built using LangChain's BaseRetriever. It extends the functionality for document retrieval

class ScoreThresholdRetriever(BaseRetriever)

Document Chain execution

Retrieved documents are passed into a document chain to generate context-aware responses

response = document_chain.invoke({
         "input": query,
         "context": retrieved_documents
     })

Screenshot 2024-11-25 012509

Usage

Splitting text

LangChain's RecursiveCharacterTextSplitter automatically handles document splitting

split_docs = split_documents(documents)

Screenshot 2024-11-25 003338

Retriever with Custom Logic

Use the ScoreThresholdRetriever to retrieve documents based on similarity scores

retrieved_docs = retriever.get_related_documents(query_embedding, collection)

Troubleshooting

Schema Mismatch: Make sure that the vector dimensionality in the Milvus collection matches the vectors being inserted.
MistralAI API Key Errors: Verify that the MistralAI API key is set correctly and has the necessary permissions.
Document loading issues:

The document loading and embedding process may fail if the file path is incorrect or inaccessible, such as with document_path = "data/textbook".
Troubleshoot by verifying the existence and accessibility of the data/textbook directory and ensuring the files are in a supported format for loading and embedding.

import os
print(os.listdir(document_path))

Environment Variables Not Loaded Correctly:

The Mistral API key may not load correctly if the .env file is not properly configured or found, causing os.getenv("MISTRAL_API_KEY") to return None and raise a ValueError.
To troubleshoot, verify that the .env file exists with the correct API key, confirm its location and the script's working directory, ensure proper file permissions, and print the API key for debugging.

print(f"Loaded API key: {os.getenv('MISTRAL_API_KEY')}")

General Error Logging: To handle any unforeseen errors in the workflow, wrap all critical sections with a try-except block.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly