Add CRAG benchmark (#88)

* add crag eval first pass code * add first pass llm eval code * fix answer correctness code Signed-off-by: minmin-intel <[email protected]> * docker container for crag eval * sample data for testing * docker compose for tgi gaudi Signed-off-by: minmin-intel <[email protected]> * fix tgi gaudi docker compose for llama3 70b * update llm eval code Signed-off-by: minmin-intel <[email protected]> * allow per sample grading Signed-off-by: minmin-intel <[email protected]> * save graded scores Signed-off-by: minmin-intel <[email protected]> * ipdate readme Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update readme and test all commands * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * mv crag_eval to agent_eval Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test case col names in grade_answer.py Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: minmin-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Sep 10, 2024 · a9b087f · a9b087f
1 parent 484b69a
commit a9b087f
Show file tree

Hide file tree

Showing 19 changed files with 649 additions and 18 deletions.
diff --git a/evals/evaluation/agent_eval/crag_eval/README.md b/evals/evaluation/agent_eval/crag_eval/README.md
@@ -0,0 +1,123 @@
+# CRAG Benchmark for Agent QnA systems
+## Overview
+[Comprehensive RAG (CRAG) benchmark](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024) was introduced by Meta in 2024 as a challenge in KDD conference. The CRAG benchmark has questions across five domains and eight question types, and provides a practical set-up to evaluate RAG systems. In particular, CRAG includes questions with answers that change from over seconds to over years; it considers entity popularity and covers not only head, but also torso and tail facts; it contains simple-fact questions as well as 7 types of complex questions such as comparison, aggregation and set questions to test the reasoning and synthesis capabilities of RAG solutions. Additionally, CRAG also provides mock APIs to query mock knowledge graphs so that developers can benchmark additional API calling capabilities for agents. Moreover, golden answers were provided in the dataset, which makes auto-evaluation with LLMs more robust. Therefore, CRAG benchmark is a realistic and comprehensive benchmark for agents.
+
+## Getting started
+1. Setup a work directory and download this repo into your work directory.
+```
+export $WORKDIR=<your-work-directory>
+cd $WORKDIR
+git clone https://github.com/opea-project/GenAIEval.git
+```
+2. Build docker image
+```
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/docker/
+bash build_image.sh
+```
+3. Set environment vars for downloading models from Huggingface
+```
+mkdir $WORKDIR/hf_cache 
+export HF_CACHE_DIR=$WORKDIR/hf_cache
+export HF_HOME=$HF_CACHE_DIR
+export HUGGINGFACEHUB_API_TOKEN=<your-hf-api-token>
+```
+4. Start docker container
+This container will be used to preprocess dataset and run benchmark scripts.
+```
+bash launch_eval_container.sh
+```
+
+## CRAG dataset
+1. Download original data and process it with commands below.
+You need to create an account on the Meta CRAG challenge [website](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024). After login, go to this [link](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/problems/meta-kdd-cup-24-crag-end-to-end-retrieval-augmented-generation/dataset_files) and download the `crag_task_3_dev_v4.tar.bz2` file. Then make a `datasets` directory in your work directory using the commands below.
+```
+cd $WORKDIR
+mkdir datasets
+```
+Then put the `crag_task_3_dev_v4.tar.bz2` file in the `datasets` directory, and decompress it by running the command below.
+```
+cd $WORKDIR/datasets
+tar -xf crag_task_3_dev_v4.tar.bz2
+```
+2. Preprocess the CRAG data
+Data preprocessing directly relates to the quality of retrieval corpus and thus can have significant impact on the agent QnA system. Here, we provide one way of preprocessing the data where we simply extracts all the web search snippets as-is from the dataset per domain. We also extract all the query-answer pairs along with other meta data per domain. You can run the command below to use our method. The data processing will take some time to finish.
+```
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/preprocess_data
+bash run_data_preprocess.sh
+```
+**Note**: This is an example of data processing. You can develop and optimize your own data processing for this benchmark.
+3. Sample queries for benchmark
+The CRAG dataset has more than 4000 queries, and running all of them can be very expensive and time-consuming. You can sample a subset for benchmark. Here we provide a script to sample up to 5 queries per question_type per dynamism in each domain. For example, we were able to get 92 queries from the music domain using the script.
+```
+bash run_sample_data.sh
+```
+
+## Launch agent QnA system
+Here we showcase a RAG agent in GenAIExample repo. Please refer to the README in the [AgentQnA example](https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA) for more details. </br>
+**Please note**: This is an example. You can build your own agent systems using OPEA components, then expose your own systems as an endpoint for this benchmark.</br>
+To launch the agent in our AgentQnA example, open another terminal and build images and launch agent system there.
+1. Build images
+```
+export $WORKDIR=<your-work-directory>
+cd $WORKDIR
+git clone https://github.com/opea-project/GenAIExamples.git
+cd GenAIExamples/AgentQnA/tests/
+bash 1_build_images.sh
+```
+2. Start retrieval tool
+```
+bash 2_start_retrieval_tool.sh
+```
+3. Ingest data into vector database and validate retrieval tool
+```
+# As an example, we will use the index_data.py script in AgentQnA example.
+# You can write your own script to ingest data.
+# As an example, We will ingest the docs of the music domain.
+# We will use the crag-eval docker container to run the index_data.py script.
+# The index_data.py is a client script.
+# it will send data-indexing requests to the dataprep server that is part of the retrieval tool.
+# So you need to switch back to the terminal where the crag-eval container is running.
+cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool/
+python3 index_data.py --host_ip $host_ip --filedir ${WORKDIR}/datasets/crag_docs/ --filename crag_docs_music.jsonl
+```
+4. Launch and validate agent endpoint
+```
+# Go to the terminal where you launched the AgentQnA example
+cd $WORKDIR/GenAIExamples/AgentQnA/tests/
+bash 4_launch_and_validate_agent.sh
+```
+
+## Run CRAG benchmark
+Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run a sampled set of queries in music domain.
+```
+# Come back to the interactive crag-eval docker container
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark
+bash run_generate_answer.sh
+```
+
+## Use LLM-as-judge to grade the answers
+1. Launch llm endpoint with HF TGI: in another terminal, run the command below. By default, `meta-llama/Meta-Llama-3-70B-Instruct` is used as the LLM judge.
+```
+cd llm_judge
+bash launch_llm_judge_endpoint.sh
+```
+2. Validate that the llm endpoint is working properly.
+```
+export host_ip=$(hostname -I | awk '{print $1}')
+curl ${host_ip}:8085/generate_stream \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -H 'Content-Type: application/json'
+```
+And then go back to the interactive crag-eval docker, run command below.
+```
+# Inside the crag-eval container
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/
+python3 test_llm_endpoint.py
+```
+3. Grade the answer correctness using LLM judge. We use `answer_correctness` metrics from [ragas](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py).
+```
+# Inside the crag-eval container
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/
+bash run_grading.sh
+```
diff --git a/evals/evaluation/agent_eval/crag_eval/docker/Dockerfile b/evals/evaluation/agent_eval/crag_eval/docker/Dockerfile
@@ -0,0 +1,24 @@
+FROM ubuntu:22.04
+
+WORKDIR /home/user
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
+    python3.11  \
+    python3-pip \
+    libpoppler-cpp-dev \
+    wget \
+    git \
+    poppler-utils \
+    libmkl-dev \
+    curl
+
+COPY requirements.txt /home/user/requirements.txt
+
+RUN pip install -r requirements.txt
+
+RUN cd /home/user/ && \
+    git clone https://github.com/opea-project/GenAIEval.git
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user/GenAIEval/
+
+WORKDIR /home/user
diff --git a/evals/evaluation/agent_eval/crag_eval/docker/build_image.sh b/evals/evaluation/agent_eval/crag_eval/docker/build_image.sh
@@ -0,0 +1,12 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+dockerfile=Dockerfile
+
+docker build \
+    -f ${dockerfile} . \
+    -t crag-eval:latest \
+    --network=host \
+    --build-arg http_proxy=${http_proxy} \
+    --build-arg https_proxy=${https_proxy} \
+    --build-arg no_proxy=${no_proxy} \
diff --git a/evals/evaluation/agent_eval/crag_eval/docker/launch_eval_container.sh b/evals/evaluation/agent_eval/crag_eval/docker/launch_eval_container.sh
@@ -0,0 +1,7 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+volume=$WORKDIR
+host_ip=$(hostname -I | awk '{print $1}')
+
+docker run -it -v $volume:/home/user/ -e WORKDIR=/home/user -e HF_HOME=/home/user/hf_cache -e host_ip=$host_ip -e http_proxy=$http_proxy -e https_proxy=$https_proxy crag-eval:latest
diff --git a/evals/evaluation/agent_eval/crag_eval/docker/requirements.txt b/evals/evaluation/agent_eval/crag_eval/docker/requirements.txt
@@ -0,0 +1,8 @@
+datasets
+evaluate
+jieba
+langchain-community
+langchain-huggingface
+pandas
+ragas
+sentence_transformers
diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/process_data.py b/evals/evaluation/agent_eval/crag_eval/preprocess_data/process_data.py
@@ -0,0 +1,120 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import os
+
+import tqdm
+
+
+def split_text(text, chunk_size=2000, chunk_overlap=400):
+    from langchain_text_splitters import RecursiveCharacterTextSplitter
+
+    text_splitter = RecursiveCharacterTextSplitter(
+        # Set a really small chunk size, just to show.
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        length_function=len,
+        is_separator_regex=False,
+        separators=["\n\n", "\n", ".", "!"],
+    )
+    return text_splitter.split_text(text)
+
+
+def process_html_string(text):
+    from bs4 import BeautifulSoup
+
+    # print(text)
+    soup = BeautifulSoup(text, features="html.parser")
+
+    # kill all script and style elements
+    for script in soup(["script", "style"]):
+        script.extract()  # rip it out
+
+    # get text
+    text_content = soup.get_text()
+
+    # break into lines and remove leading and trailing space on each
+    lines = (line.strip() for line in text_content.splitlines())
+    # break multi-headlines into a line each
+    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
+    # drop blank lines
+    final_text = "\n".join(chunk for chunk in chunks if chunk)
+    # print(final_text)
+    return final_text
+
+
+def preprocess_data(input_file):
+    snippet = []
+    return_data = []
+    n = 0
+    with open(input_file, "r") as f:
+        for line in f:
+            data = json.loads(line)
+
+            # search results snippets --> retrieval corpus docs
+            docs = data["search_results"]
+
+            for doc in docs:
+                # chunks = split_text(doc['page_snippet'])
+                # for chunk in chunks:
+                #     snippet.append({
+                #         "query": data['query'],
+                #         "domain": data['domain'],
+                #         "doc":chunk})
+                snippet.append({"query": data["query"], "domain": data["domain"], "doc": doc["page_snippet"]})
+
+            # qa pairs without search results
+            output = {}
+            for k, v in data.items():
+                if k != "search_results":
+                    output[k] = v
+            return_data.append(output)
+
+            n += 1
+            if n == 10:
+                break
+
+    return snippet, return_data
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--filedir", type=str, default=None)
+    parser.add_argument("--docout", type=str, default=None)
+    parser.add_argument("--qaout", type=str, default=None)
+    # parser.add_argument('--chunk_size', type=int, default=10000)
+    # parser.add_argument('--chunk_overlap', type=int, default=0)
+
+    args = parser.parse_args()
+
+    if not os.path.exists(args.docout):
+        os.makedirs(args.docout)
+
+    if not os.path.exists(args.qaout):
+        os.makedirs(args.qaout)
+
+    data_files = os.listdir(args.filedir)
+
+    qa_pairs = []
+    docs = []
+    for file in tqdm.tqdm(data_files):
+        file = os.path.join(args.filedir, file)
+        doc, data = preprocess_data(file)
+        docs.extend(doc)
+        qa_pairs.extend(data)
+
+    # group by domain
+    domains = ["finance", "music", "movie", "sports", "open"]
+
+    for domain in domains:
+        with open(os.path.join(args.docout, "crag_docs_" + domain + ".jsonl"), "w") as f:
+            for doc in docs:
+                if doc["doc"] != "" and doc["domain"] == domain:
+                    f.write(json.dumps(doc) + "\n")
+
+        with open(os.path.join(args.qaout, "crag_qa_" + domain + ".jsonl"), "w") as f:
+            for d in qa_pairs:
+                if d["domain"] == domain:
+                    f.write(json.dumps(d) + "\n")
diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_data_preprocess.sh b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_data_preprocess.sh
@@ -0,0 +1,8 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FILEDIR=$WORKDIR/datasets/crag_task_3_dev_v4
+DOCOUT=$WORKDIR/datasets/crag_docs/
+QAOUT=$WORKDIR/datasets/crag_qas/
+
+python3 process_data.py --filedir $FILEDIR --docout $DOCOUT --qaout $QAOUT
diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_sample_data.sh b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_sample_data.sh
@@ -0,0 +1,6 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FILEDIR=$WORKDIR/datasets/crag_qas
+
+python3 sample_data.py --filedir $FILEDIR
diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/sample_data.py b/evals/evaluation/agent_eval/crag_eval/preprocess_data/sample_data.py
@@ -0,0 +1,33 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import os
+
+import pandas as pd
+import tqdm
+
+
+def sample_data(input_file, output_file):
+    df = pd.read_json(input_file, lines=True, convert_dates=False)
+    # group by `question_type` and `static_or_dynamic`
+    df_grouped = df.groupby(["question_type", "static_or_dynamic"])
+    # sample 5 rows from each group if there are more than 5 rows else return all rows
+    df_sampled = df_grouped.apply(lambda x: x.sample(5) if len(x) > 5 else x)
+    # save sampled data to output file
+    df_sampled.to_json(output_file, orient="records", lines=True)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--filedir", type=str, default=None)
+
+    args = parser.parse_args()
+
+    data_files = os.listdir(args.filedir)
+    for file in tqdm.tqdm(data_files):
+        print(file)
+        file = os.path.join(args.filedir, file)
+        output_file = file.replace(".jsonl", "_sampled.jsonl")
+        sample_data(file, output_file)