diff --git a/graph_rag/evaluation/README.MD b/graph_rag/evaluation/README.MD
new file mode 100644
index 0000000..32a4553
--- /dev/null
+++ b/graph_rag/evaluation/README.MD
@@ -0,0 +1,196 @@
+
+# Knowledge Graph Evaluation
+
+This module provides methods to evaluate the performance of GraphRag. The following integrations are available for evaluation:
+
+- **Llama-Index Evaluation Pack**
+- **Ragas Evaluation Pack**
+
+Additionally, this module includes scripts for creating custom test datasets to benchmark and evaluate GraphRag.
+
+## Getting Started
+This section demonstrates how to use the functions provided in the module:
+
+---
+
+ ### 1. QA Generation and Critique
+
+This module offers tools to generate question-answer (QA) pairs from input documents using a language model and critique them based on various criteria like groundedness, relevance, and standalone quality.
+
+> #### Generate and Critique QA Pairs
+
+To use this module, follow these steps:
+
+#### 1. Generate QA Pairs
+
+First, we prepare our dataset for generating QA pairs. In this example, we'll use Keras-IO documentation and Llama-Index's `SimpleDirectoryReader` to obtain `Document` objects.
+
+```python
+!git clone https://github.com/keras-team/keras-io.git
+
+def get_data(input_dir="path/to/keras-io/templates"):
+    reader = SimpleDirectoryReader(
+        input_dir, 
+        recursive=True, 
+        exclude=["path/to/keras-io/templates/examples"]
+    )
+    docs = reader.load_data()
+
+    splitter = SentenceSplitter(
+        chunk_size=300,
+        chunk_overlap=20,
+    )
+    nodes = splitter.get_nodes_from_documents(docs)
+    documents = [Document(text=node.text, metadata=node.metadata) for node in nodes]
+    
+    return docs
+    
+# load the documents
+documents=get_data()
+```
+
+Use the `qa_generator` function to generate QA pairs from your input documents.
+
+```python
+from evaluation.ragas_evaluation.QA_graphrag_testdataset import qa_generator
+
+N_GENERATIONS = 20
+
+# Generate the QA pairs
+qa_pairs = qa_generator(documents, N_GENERATIONS)
+```
+
+#### 2. Critique the Generated QA Pairs
+
+Once you have generated the QA pairs, critique them using the `critique_qa` function.
+
+```python
+from evaluation.ragas_evaluation.QA_graphrag_testdatasete import critique_qa
+
+# Critique the generated QA pairs
+critiqued_qa_pairs = critique_qa(qa_pairs)
+
+# The critiqued pairs will include scores and evaluations for groundedness, relevance, and standalone quality
+```
+
+---
+### 2. Evaluating Your Knowledge Graph with Llama-Index Evaluator Pack
+
+This section demonstrates how to evaluate the performance of your query engine using the Llama-Index RAG evaluator pack.
+
+> #### Evaluate Your Knowledge Graph with llama-index
+
+To evaluate your query engine, follow these steps:
+```shell
+llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
+```
+
+```python
+from evaluation.evaluation_llama_index import evaluate
+
+
+# Path to your labeled RAG dataset
+RAG_DATASET = "./data/rag_dataset.json"
+
+# Define the language model and embedding
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.llms.ollama import Ollama
+
+llm = Ollama(base_url="http://localhost:11434", model="llama2")
+embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base")
+
+# Your query engine instance
+from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine
+
+index = get_index_from_pickle("path/to/graphIndex.pkl")
+query_engine = get_query_engine(index)
+
+# Evaluate the dataset
+evaluation_results = evaluate(RAG_DATASET, query_engine)
+
+# Review the results
+print(evaluation_results)
+```
+| Metrics                      | RAG        | Base RAG  |
+|------------------------------|------------|-----------|
+| **Mean Correctness Score**    | 3.340909   |         0.934  |
+| **Mean Relevancy Score**      | 0.750000   |    4.239       |
+| **Mean Faithfulness Score**   | 0.386364   |   0.977        |
+| **Mean Context Similarity Score** | 0.948765 |     0.977      |
+
+
+
+This example shows how to quickly evaluate your query engine's performance using the Llama-Index RAG evaluator pack.
+
+
+---
+### 3. Evaluating Your Knowledge Graph with Ragas backend
+
+You can easily evaluate the performance of your query engine using this module.
+
+> #### Load and Evaluate Your Dataset with ragas
+
+Use the `load_test_dataset` function to load your dataset and directly evaluate it using the `evaluate` function. This method handles all necessary steps, including batching the data.
+
+```python
+from evaluation.ragas_evaluation.evaluation_ragas load_test_dataset, evaluate
+
+# Step 1: Load the dataset from a pickle file
+dataset_path = "/content/keras_docs_embedded.pkl"
+test_dataset = load_test_dataset(dataset_path)
+```
+
+> **Note:** `test_dataset` is a list of Llama-Index `Document` objects.
+
+```python
+# Step 2: Define the language model and embedding
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.llms.ollama import Ollama
+
+llm = Ollama(base_url="http://localhost:11434", model="codellama")
+embedding = HuggingFaceEmbedding(model_name="microsoft/codebert-base")
+
+# Step 3: Specify the metrics for evaluation
+metrics = [faithfulness, answer_relevancy, context_precision, context_recall]
+
+# Step 4: Load the query engine (Llama-Index)
+from graph_rag.graph_retrieval.graph_retrieval import get_index_from_pickle, get_query_engine
+
+index = get_index_from_pickle("path/to/graphIndex.pkl")
+query_engine = get_query_engine(index)
+
+# Step 5: Evaluate the dataset
+evaluation_results = evaluate(
+    query_engine=query_engine,
+    dataset=test_dataset,
+    llm=llm,
+    embeddings=embedding,
+    metrics=metrics,
+    # Default batch size is 4
+)
+```
+
+**Output:**
+```python
+{'faithfulness': 0.0333, 'answer_relevancy': 0.9834, 'context_precision': 0.2000, 'context_recall': 0.8048}
+```
+
+```python
+rdf = evaluation_results.to_pandas()
+rdf.to_csv("results.csv", index=False)
+```
+---
+**Detailed Result:**
+
+| question                                      | contexts                                                                                                            | answer                                                                                                   | ground_truth                                                                                             | faithfulness | answer_relevancy | context_precision | context_recall |
+|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--------------|------------------|-------------------|----------------|
+| What is mixed precision in computing?         | [Examples GPT-2 text generation Parameter…]                                                                        | Mixed precision is a technique used to improve…                                                          | A combination of different numerical precision…                                                             | 0.166667     | 0.981859         | 0.0               | 0.666667       |
+| What is the title of the guide discussed in th... | [Available guides… Hyperparameter T…]                                                                              | The title of the guide discussed in the given…                                                           | How to distribute training                                                                                  | 0.000000     | 1.000000         | 0.0               | 1.000000       |
+| What is Keras 3?                              | [No relationships found.]                                                                                          | Keras 3 is a new version of the popular deep l…                                                          | A deep learning framework that works with Tensor…                                                            | 0.000000     | 0.974711         | 0.0               | 0.500000       |
+| What was the percentage boost in StableDiffusion... | [A first example: A MNIST convnet…]                                                                                | The percentage boost in StableDiffusion traini…                                                          | Over 150%                                                                                                    | 0.000000     | 0.970565         | 1.0               | 1.000000       |
+| What are some examples of pretrained models av... | [No relationships found.]                                                                                          | Some examples of pre-trained models available…                                                           | BERT, OPT, Whisper, T5, StableDiffusion, YOLOv8…                                                             | 0.000000     | 0.989769         | 0.0               | 0.857143       |
+
+
+
+
+
diff --git a/graph_rag/evaluation/evaluation_llama_index.py b/graph_rag/evaluation/evaluation_llama_index.py
new file mode 100644
index 0000000..31e4e54
--- /dev/null
+++ b/graph_rag/evaluation/evaluation_llama_index.py
@@ -0,0 +1,49 @@
+"""
+This script evaluates a RagDataset using a RagEvaluatorPack, which assesses query engines by benchmarking against
+labeled data using LLMs and embeddings.
+
+Functions:
+- evaluate: Evaluates the query engine using a labeled RAG dataset and specified models for both the LLM and embeddings.
+"""
+
+from llama_index.core.llama_dataset import LabelledRagDataset
+from llama_index.core.llama_pack import download_llama_pack
+from llama_index.llms.ollama import Ollama
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+
+
+
+
+
+def evaluate(
+    RAG_DATASET: str,
+    query_engine: object,
+    ollama_model: str = "llama3",
+    embedd_model: str = "microsoft/codebert-base",
+):
+    """
+    Evaluates a RAG dataset by using a query engine and benchmarks it using LLM and embedding models.
+
+    Args:
+        RAG_DATASET: Path to the JSON file containing the labeled RAG dataset.
+        query_engine: The query engine to evaluate.
+        ollama_model: The LLM model to use for evaluation (default: "llama3").
+        embedd_model: The Hugging Face embedding model to use for evaluation (default: "microsoft/codebert-base").
+
+    Returns:
+        A DataFrame containing the benchmarking results, including LLM calls and evaluations.
+    """
+
+    RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./rag_evaluator_pack")
+    rag_dataset = LabelledRagDataset.from_json(RAG_DATASET)
+    rag_evaluator_pack = RagEvaluatorPack(
+        rag_dataset=rag_dataset,
+        query_engine=query_engine,
+        judge_llm=Ollama(base_url="http://localhost:11434", model=ollama_model),
+        embed_model=HuggingFaceEmbedding(model_name=embedd_model),
+    )
+    benchmark_df = await rag_evaluator_pack.arun(
+        batch_size=5,  # batches the number of llm calls to make
+        sleep_time_in_seconds=1,  # seconds to sleep before making an api call
+    )
+    return benchmark_df
diff --git a/graph_rag/evaluation/ragas_evaluation/QA_graphrag_testdataset.py b/graph_rag/evaluation/ragas_evaluation/QA_graphrag_testdataset.py
new file mode 100644
index 0000000..f162e5c
--- /dev/null
+++ b/graph_rag/evaluation/ragas_evaluation/QA_graphrag_testdataset.py
@@ -0,0 +1,130 @@
+"""
+This script contains functions to generate question-answer pairs from input documents using a language model,
+and critique them based on various criteria like groundedness, relevance, and standalone quality.
+
+Functions:
+- get_response: Sends a request to a language model API to generate responses based on a provided prompt.
+- qa_generator: Generates a specified number of question-answer pairs from input documents.
+- critique_qa: Critiques the generated QA pairs based on groundedness, relevance, and standalone quality.
+"""
+
+from prompts import *
+import pandas as pd
+import random
+from tqdm.auto import tqdm
+import requests
+
+
+def get_response(
+    prompt: str, url: str = "http://localhost:11434/api/generate", model: str = "llama3"
+):
+    """
+    Sends a prompt ollama API and retrieves the generated response.
+
+    Args:
+        prompt:The text input that the model will use to generate a response.
+        url: The API endpoint for the model (default: "http://localhost:11434/api/generate").
+        model: The model to be used for generation (default: "llama3").
+
+    Returns:
+        The generated response from the language model as a string.
+    """
+
+    payload = {"model": model, "prompt": prompt, "stream": False}
+    response = requests.post(url, json=payload)
+    resp = response.json()
+    return resp["response"]
+
+
+def qa_generator(
+    documents: object,
+    N_GENERATIONS: int = 20,
+):
+    """
+    Generates a specified number of question-answer pairs from the provided documents.
+
+    Args:
+        documents: A collection of document objects to generate QA pairs from.
+        N_GENERATIONS: The number of question-answer pairs to generate (default: 20).
+
+    Returns:
+        A list of dictionaries, each containing the generated context, question, answer, and source document metadata.
+    """
+    print(f"Generating {N_GENERATIONS} QA couples...")
+
+    outputs = []
+    for sampled_context in tqdm(random.sample(documents, N_GENERATIONS)):
+        # Generate QA couple
+        output_QA_couple = get_response(
+            QA_generation_prompt.format(context=sampled_context.text)
+        )
+        try:
+            question = output_QA_couple.split("Factoid question: ")[-1].split(
+                "Answer: "
+            )[0]
+            answer = output_QA_couple.split("Answer: ")[-1]
+            assert len(answer) < 300, "Answer is too long"
+            outputs.append(
+                {
+                    "context": sampled_context.text,
+                    "question": question,
+                    "answer": answer,
+                    "source_doc": sampled_context.metadata,
+                }
+            )
+        except:
+            continue
+    df = pd.DataFrame(outputs)
+    df.to_csv("QA.csv")
+    return outputs
+
+
+def critique_qa(
+    outputs: list,
+):
+    """
+    Critiques the generated question-answer pairs based on groundedness, relevance, and standalone quality.
+
+    Args:
+        outputs: A list of dictionaries containing generated QA pairs to be critiqued.
+
+    Returns:
+        The critiqued QA pairs with additional fields for groundedness, relevance, and standalone quality scores and evaluations.
+    """
+    print("Generating critique for each QA couple...")
+    for output in tqdm(outputs):
+        evaluations = {
+            "groundedness": get_response(
+                question_groundedness_critique_prompt.format(
+                    context=output["context"], question=output["question"]
+                ),
+            ),
+            "relevance": get_response(
+                question_relevance_critique_prompt.format(question=output["question"]),
+            ),
+            "standalone": get_response(
+                question_standalone_critique_prompt.format(question=output["question"]),
+            ),
+        }
+        try:
+            for criterion, evaluation in evaluations.items():
+                score, eval = (
+                    int(evaluation.split("Total rating: ")[-1].strip()),
+                    evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
+                )
+                output.update(
+                    {
+                        f"{criterion}_score": score,
+                        f"{criterion}_eval": eval,
+                    }
+                )
+        except Exception as e:
+            continue
+        generated_questions = pd.DataFrame.from_dict(outputs)
+        generated_questions = generated_questions.loc[
+            (generated_questions["groundedness_score"] >= 4)
+            & (generated_questions["relevance_score"] >= 4)
+            & (generated_questions["standalone_score"] >= 4)
+        ]
+        generated_questions.to_csv("generated_questions.csv")
+        return outputs
diff --git a/graph_rag/evaluation/ragas_evaluation/evaluation_ragas.py b/graph_rag/evaluation/ragas_evaluation/evaluation_ragas.py
new file mode 100644
index 0000000..568562b
--- /dev/null
+++ b/graph_rag/evaluation/ragas_evaluation/evaluation_ragas.py
@@ -0,0 +1,118 @@
+"""
+This script loads a pre-processed dataset, slices it for batch evaluation, and runs a series of metrics to evaluate the
+performance of a query engine using a language model and embeddings.
+
+Functions:
+- load_test_dataset: Loads a test dataset from a pickle file.
+- slice_data: Slices the dataset into batches for evaluation.
+- evaluate: Runs evaluation on the sliced dataset using specified metrics, LLMs, and embeddings.
+
+"""
+
+import pickle
+import pandas as pd
+from datasets import Dataset
+from ragas.integrations.llama_index import evaluate
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from ragas.metrics.critique import harmfulness
+from llama_index.llms.ollama import Ollama
+from ragas.metrics import (
+    faithfulness,
+    answer_relevancy,
+    context_precision,
+    context_recall,
+)
+
+
+def load_test_dataset(
+    data: str,
+):
+    """
+       Loads a test dataset from a pickle file.
+
+       Args:
+           data: The path to the dataset file in pickle format.
+
+       Returns:
+           A dictionary representing the loaded dataset or an empty dictionary if loading fails due to EOFError.
+       """
+    try:
+        with open(data, "rb") as f:
+            dataset = pickle.load(f)
+    except EOFError:
+        print("EOFError: The file may be corrupted or incomplete loading empty dictionary.")
+        dataset = []
+    return dataset
+
+
+def slice_data(i: int, k: int, dataset: list):
+    """
+        Slices the dataset into smaller chunks for batch processing.
+
+        Args:
+            i: The starting index for the slice.
+            k: The size of the slice (number of records to include in each batch).
+            dataset: The dictionary representing the dataset to be sliced.
+
+        Returns:
+            A dictionary containing the sliced dataset with renamed columns for consistency with the evaluation process.
+        """
+
+    hf_dataset = Dataset.from_list(dataset[i : i + k])
+    hf_dataset = hf_dataset.rename_column("context", "contexts")
+    hf_dataset = hf_dataset.rename_column("answer", "ground_truth")
+    ds_dict = hf_dataset.to_dict()
+    return ds_dict
+
+
+def evaluate(
+    query_engine: object,
+    dataset: object,
+    batch: int = 4,
+    metrics: list = [
+        faithfulness,
+        answer_relevancy,
+        context_precision,
+        context_recall,
+    ],
+    llm: object = Ollama(base_url="http://localhost:11434", model="codellama"),
+    embeddings=HuggingFaceEmbedding(model_name="microsoft/codebert-base"),
+):
+    """
+       Evaluates the performance of a query engine on a dataset using various metrics and a language model.
+
+       Args:
+           query_engine: The query engine to be evaluated.
+           dataset: The dataset to be evaluated against.
+           batch: The number of records to process in each batch (default: 4).
+           metrics: A list of metrics to be used for evaluation (default: faithfulness, answer relevancy, context precision, and context recall).
+           llm: The language model to be used for evaluation (default: Ollama with model 'codellama').
+           embeddings: The embedding model to be used (default: HuggingFaceEmbedding with 'microsoft/codebert-base').
+
+       Returns:
+           A pandas DataFrame containing the evaluation results for each batch.
+       """
+
+    rows_count = len(next(iter(dataset.values())))
+
+    results_df = pd.DataFrame()
+
+    for i in range(0, rows_count, batch):
+
+        batch_data = slice_data(i, batch, dataset=dataset)
+
+        result = evaluate(
+            query_engine=query_engine,
+            metrics=metrics,
+            dataset=batch_data,
+            llm=llm,
+            embeddings=embeddings,
+        )
+
+        rdf = result.to_pandas()
+        results_df = pd.concat([results_df, rdf], ignore_index=True)
+        print(f"Processed batch {i // batch + 1}:")
+        print(rdf)
+    print(results_df)
+    results_df.to_csv("results.csv", index=False)
+    return results_df
diff --git a/graph_rag/evaluation/ragas_evaluation/prompts.py b/graph_rag/evaluation/ragas_evaluation/prompts.py
new file mode 100644
index 0000000..343e4db
--- /dev/null
+++ b/graph_rag/evaluation/ragas_evaluation/prompts.py
@@ -0,0 +1,79 @@
+"""
+This file contains PROMPTS that are passed to llms to generate and critique Test-Dataset for Graph_Rag
+"""
+
+QA_generation_prompt = """
+Your task is to write a factoid question and an answer given a context.
+Your factoid question should be answerable with a specific, concise piece of factual information from the context.
+Your factoid question should be formulated in the same style as questions users could ask in a search engine.
+This means that your factoid question MUST NOT mention something like "according to the passage" or "context".
+
+Provide your answer as follows:
+
+Output:::
+Factoid question: (your factoid question)
+Answer: (your answer to the factoid question)
+
+Now here is the context.
+
+Context: {context}\n
+Output:::"""
+
+question_groundedness_critique_prompt = """
+You will be given a context and a question.
+Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
+Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.
+
+Provide your answer as follows:
+
+Answer:::
+Evaluation: (your rationale for the rating, as a text)
+Total rating: (your rating, as a number between 1 and 5)
+
+You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
+
+Now here are the question and context.
+
+Question: {question}\n
+Context: {context}\n
+Answer::: """
+
+question_relevance_critique_prompt = """
+You will be given a question.
+Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
+Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.
+
+Provide your answer as follows:
+
+Answer:::
+Evaluation: (your rationale for the rating, as a text)
+Total rating: (your rating, as a number between 1 and 5)
+
+You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
+
+Now here is the question.
+
+Question: {question}\n
+Answer::: """
+
+question_standalone_critique_prompt = """
+You will be given a question.
+Your task is to provide a 'total rating' representing how context-independant this question is.
+Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
+For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
+The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.
+
+For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.
+
+Provide your answer as follows:
+
+Answer:::
+Evaluation: (your rationale for the rating, as a text)
+Total rating: (your rating, as a number between 1 and 5)
+
+You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
+
+Now here is the question.
+
+Question: {question}\n
+Answer::: """
diff --git a/graph_rag/evaluation/random/dataset_200_llama3.pkl b/graph_rag/evaluation/random/dataset_200_llama3.pkl
new file mode 100644
index 0000000..c9600cb
Binary files /dev/null and b/graph_rag/evaluation/random/dataset_200_llama3.pkl differ
diff --git a/graph_rag/evaluation/random/keras_docs_embedded.pkl b/graph_rag/evaluation/random/keras_docs_embedded.pkl
new file mode 100644
index 0000000..5f925c2
Binary files /dev/null and b/graph_rag/evaluation/random/keras_docs_embedded.pkl differ
diff --git a/graph_rag/evaluation/random/results_5.csv b/graph_rag/evaluation/random/results_5.csv
new file mode 100644
index 0000000..593b8a2
--- /dev/null
+++ b/graph_rag/evaluation/random/results_5.csv
@@ -0,0 +1,16 @@
+question,contexts,answer,ground_truth,faithfulness,answer_relevancy,context_precision,context_recall
+"What is mixed precision in computing?
+","['Examples\n\n* GPT-2 text generation\n* Parameter-efficient fine-tuning of GPT-2 with LoRA\n* Semantic Similarity\n* Sentence embeddings using Siamese RoBERTa-networks\n* Data Parallel Training with tf.distribute\n* English-to-Spanish translation\n* GPT text generation from scratch\n* Text Classification using FNet\n\n---'
+ ""Supported hardware\n\nWhile mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Google TPUs.\nNVIDIA GPUs support using a mix of float16 and float32, while TPUs support a mix of bfloat16 and float32.\n\nAmong NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit\nfrom mixed precision because they have special hardware units, called Tensor Cores,\nto accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math\nperformance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups.\nYou can look up the compute capability for your GPU at NVIDIA's CUDA GPU web page.\nExamples of GPUs that will benefit most from mixed precision include RTX GPUs, the V100, and the A100.\n\nEven on CPUs and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing,\ndebugging, or just to try out the API. On CPUs, mixed precision will run significantly slower, however.\n\n\nYou can check your GPU type with the following command:\n\n```\nnvidia-smi -L\n```""
+ ""The following are knowledge sequence in max depth 2 in the form of directed graph like:\n`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`\n['Mixed precision', 'Will run on most hardware', 'But will only speed up models on recent nvidia gpus and google tpus']""]","Mixed precision is a technique used to improve the efficiency of computer computations by using lower-precision data types (e.g., float16 or bfloat16) for some parts of the computation while still maintaining accuracy. It is commonly used in deep learning and other numerical compute-intensive applications, where memory bandwidth and computational resources are limited.
+
+While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Google TPUs. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA's CUDA GPU web page to determine whether it will benefit from mixed precision. Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the V100, and the A100.",A combination of different numerical precisions used to optimize memory usage and computation speed.,0.16666666666666666,0.9818589787934606,0.0,0.6666666666666666
+"What is the title of the guide discussed in the given context?
+","['Available guides\n\n{{toc}}'
+ 'Hyperparameter Tuning\n\nThese guides cover KerasTuner best practices.']","The title of the guide discussed in the given context is ""Hyperparameter Tuning""",How to distribute training,0.0,0.9999999999999996,0.0,1.0
+"What is Keras 3?
+",['No relationships found.'],"Keras 3 is a new version of the popular deep learning framework Keras that was released in October 2020. It introduces several new features and improvements over the previous versions, including support for more complex architectures, improved performance on certain tasks, and easier integration with other libraries and frameworks.","A deep learning framework that works with TensorFlow, JAX, and PyTorch interchangeably.",0.0,0.9747114512069875,0.0,0.5
+"What was the percentage boost in StableDiffusion training throughput with Keras 3?
+","['A first example: A MNIST convnet\n\nLet\'s start with the Hello World of ML: training a convnet\nto classify MNIST digits.\n\nHere\'s the data:\n\n\n```python\n# Load the data and split it between train and test sets\n(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()\n\n# Scale images to the [0, 1] range\nx_train = x_train.astype(""float32"") / 255\nx_test = x_test.astype(""float32"") / 255\n# Make sure images have shape (28, 28, 1)\nx_train = np.expand_dims(x_train, -1)\nx_test = np.expand_dims(x_test, -1)\nprint(""x_train shape:"", x_train.shape)\nprint(""y_train shape:"", y_train.shape)\nprint(x_train.shape[0], ""train samples"")\nprint(x_test.shape[0], ""test samples"")\n```\n\n\n```\nx_train shape: (60000, 28, 28, 1)\ny_train shape: (60000,)\n60000 train samples\n10000 test samples\n\n```\n\nHere\'s our model.\n\nDifferent model-building options that Keras offers include:\n\n- The Sequential API (what we use below)\n- The Functional API (most typical)\n- Writing your own models yourself via subclassing (for advanced use cases)\n\n\n```python\n# Model parameters\nnum_classes = 10\ninput_shape = (28, 28, 1)\n\nmodel = keras.Sequential(\n    [\n        keras.layers.Input(shape=input_shape),\n        keras.layers.Conv2D(64, kernel_size=(3, 3), activation=""relu""),\n        keras.layers.Conv2D(64, kernel_size=(3, 3), activation=""relu""),\n        keras.layers.MaxPooling2D(pool_size=(2, 2)),\n        keras.layers.Conv2D(128, kernel_size=(3, 3), activation=""relu""),\n        keras.layers.Conv2D(128, kernel_size=(3, 3), activation=""relu""),\n        keras.layers.GlobalAveragePooling2D(),\n        keras.layers.Dropout(0.5),\n        keras.layers.Dense(num_classes, activation=""softmax""),\n    ]\n)\n```\n\nHere\'s our model summary:\n\n\n```python\nmodel.summary()\n```\n\n\nModel: ""sequential""']","The percentage boost in StableDiffusion training throughput with Keras 3 is not mentioned in the provided context information. Therefore, I cannot answer this question.",Over 150%,0.0,0.9705647972395562,0.9999999999,1.0
+"What are some examples of pretrained models available in KerasCV and KerasNLP?
+",['No relationships found.'],"Some examples of pre-trained models available in KerasCV include the InceptionV3 model, the ResNet50 model, and the VGG16 model. Some examples of pre-trained models available in KerasNLP include the BERT model, the RoBERTa model, and the DistilBERT model. These models can be used for a variety of natural language processing tasks such as text classification, sentiment analysis, named entity recognition, and question answering.","BERT, OPT, Whisper, T5, StableDiffusion, YOLOv8, SegmentAnything, etc.",0.0,0.9897694771234743,0.0,0.8571428571428571
diff --git a/graph_rag/graph_builder/requirements.txt b/graph_rag/graph_builder/requirements.txt
index 7bf1028..fd4ba3d 100644
--- a/graph_rag/graph_builder/requirements.txt
+++ b/graph_rag/graph_builder/requirements.txt
@@ -3,4 +3,8 @@ llama-index-llms-ollama
 llama-index
 pyvis
 tree-sitter==0.21.3
-tree-sitter-languages
\ No newline at end of file
+tree-sitter-languages
+tqdm
+ragas
+datasets
+pandas
\ No newline at end of file