diff --git a/evals/evaluation/agent_eval/crag_eval/README.md b/evals/evaluation/agent_eval/crag_eval/README.md new file mode 100644 index 00000000..7b66f8a0 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/README.md @@ -0,0 +1,123 @@ +# CRAG Benchmark for Agent QnA systems +## Overview +[Comprehensive RAG (CRAG) benchmark](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024) was introduced by Meta in 2024 as a challenge in KDD conference. The CRAG benchmark has questions across five domains and eight question types, and provides a practical set-up to evaluate RAG systems. In particular, CRAG includes questions with answers that change from over seconds to over years; it considers entity popularity and covers not only head, but also torso and tail facts; it contains simple-fact questions as well as 7 types of complex questions such as comparison, aggregation and set questions to test the reasoning and synthesis capabilities of RAG solutions. Additionally, CRAG also provides mock APIs to query mock knowledge graphs so that developers can benchmark additional API calling capabilities for agents. Moreover, golden answers were provided in the dataset, which makes auto-evaluation with LLMs more robust. Therefore, CRAG benchmark is a realistic and comprehensive benchmark for agents. + +## Getting started +1. Setup a work directory and download this repo into your work directory. +``` +export $WORKDIR= +cd $WORKDIR +git clone https://github.com/opea-project/GenAIEval.git +``` +2. Build docker image +``` +cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/docker/ +bash build_image.sh +``` +3. Set environment vars for downloading models from Huggingface +``` +mkdir $WORKDIR/hf_cache +export HF_CACHE_DIR=$WORKDIR/hf_cache +export HF_HOME=$HF_CACHE_DIR +export HUGGINGFACEHUB_API_TOKEN= +``` +4. Start docker container +This container will be used to preprocess dataset and run benchmark scripts. +``` +bash launch_eval_container.sh +``` + +## CRAG dataset +1. Download original data and process it with commands below. +You need to create an account on the Meta CRAG challenge [website](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024). After login, go to this [link](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/problems/meta-kdd-cup-24-crag-end-to-end-retrieval-augmented-generation/dataset_files) and download the `crag_task_3_dev_v4.tar.bz2` file. Then make a `datasets` directory in your work directory using the commands below. +``` +cd $WORKDIR +mkdir datasets +``` +Then put the `crag_task_3_dev_v4.tar.bz2` file in the `datasets` directory, and decompress it by running the command below. +``` +cd $WORKDIR/datasets +tar -xf crag_task_3_dev_v4.tar.bz2 +``` +2. Preprocess the CRAG data +Data preprocessing directly relates to the quality of retrieval corpus and thus can have significant impact on the agent QnA system. Here, we provide one way of preprocessing the data where we simply extracts all the web search snippets as-is from the dataset per domain. We also extract all the query-answer pairs along with other meta data per domain. You can run the command below to use our method. The data processing will take some time to finish. +``` +cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/preprocess_data +bash run_data_preprocess.sh +``` +**Note**: This is an example of data processing. You can develop and optimize your own data processing for this benchmark. +3. Sample queries for benchmark +The CRAG dataset has more than 4000 queries, and running all of them can be very expensive and time-consuming. You can sample a subset for benchmark. Here we provide a script to sample up to 5 queries per question_type per dynamism in each domain. For example, we were able to get 92 queries from the music domain using the script. +``` +bash run_sample_data.sh +``` + +## Launch agent QnA system +Here we showcase a RAG agent in GenAIExample repo. Please refer to the README in the [AgentQnA example](https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA) for more details.
+**Please note**: This is an example. You can build your own agent systems using OPEA components, then expose your own systems as an endpoint for this benchmark.
+To launch the agent in our AgentQnA example, open another terminal and build images and launch agent system there. +1. Build images +``` +export $WORKDIR= +cd $WORKDIR +git clone https://github.com/opea-project/GenAIExamples.git +cd GenAIExamples/AgentQnA/tests/ +bash 1_build_images.sh +``` +2. Start retrieval tool +``` +bash 2_start_retrieval_tool.sh +``` +3. Ingest data into vector database and validate retrieval tool +``` +# As an example, we will use the index_data.py script in AgentQnA example. +# You can write your own script to ingest data. +# As an example, We will ingest the docs of the music domain. +# We will use the crag-eval docker container to run the index_data.py script. +# The index_data.py is a client script. +# it will send data-indexing requests to the dataprep server that is part of the retrieval tool. +# So you need to switch back to the terminal where the crag-eval container is running. +cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool/ +python3 index_data.py --host_ip $host_ip --filedir ${WORKDIR}/datasets/crag_docs/ --filename crag_docs_music.jsonl +``` +4. Launch and validate agent endpoint +``` +# Go to the terminal where you launched the AgentQnA example +cd $WORKDIR/GenAIExamples/AgentQnA/tests/ +bash 4_launch_and_validate_agent.sh +``` + +## Run CRAG benchmark +Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run a sampled set of queries in music domain. +``` +# Come back to the interactive crag-eval docker container +cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark +bash run_generate_answer.sh +``` + +## Use LLM-as-judge to grade the answers +1. Launch llm endpoint with HF TGI: in another terminal, run the command below. By default, `meta-llama/Meta-Llama-3-70B-Instruct` is used as the LLM judge. +``` +cd llm_judge +bash launch_llm_judge_endpoint.sh +``` +2. Validate that the llm endpoint is working properly. +``` +export host_ip=$(hostname -I | awk '{print $1}') +curl ${host_ip}:8085/generate_stream \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' +``` +And then go back to the interactive crag-eval docker, run command below. +``` +# Inside the crag-eval container +cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/ +python3 test_llm_endpoint.py +``` +3. Grade the answer correctness using LLM judge. We use `answer_correctness` metrics from [ragas](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py). +``` +# Inside the crag-eval container +cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/ +bash run_grading.sh +``` diff --git a/evals/evaluation/agent_eval/crag_eval/docker/Dockerfile b/evals/evaluation/agent_eval/crag_eval/docker/Dockerfile new file mode 100644 index 00000000..a3a97c5b --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/docker/Dockerfile @@ -0,0 +1,24 @@ +FROM ubuntu:22.04 + +WORKDIR /home/user + +RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \ + python3.11 \ + python3-pip \ + libpoppler-cpp-dev \ + wget \ + git \ + poppler-utils \ + libmkl-dev \ + curl + +COPY requirements.txt /home/user/requirements.txt + +RUN pip install -r requirements.txt + +RUN cd /home/user/ && \ + git clone https://github.com/opea-project/GenAIEval.git + +ENV PYTHONPATH=$PYTHONPATH:/home/user/GenAIEval/ + +WORKDIR /home/user \ No newline at end of file diff --git a/evals/evaluation/agent_eval/crag_eval/docker/build_image.sh b/evals/evaluation/agent_eval/crag_eval/docker/build_image.sh new file mode 100644 index 00000000..a743900f --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/docker/build_image.sh @@ -0,0 +1,12 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +dockerfile=Dockerfile + +docker build \ + -f ${dockerfile} . \ + -t crag-eval:latest \ + --network=host \ + --build-arg http_proxy=${http_proxy} \ + --build-arg https_proxy=${https_proxy} \ + --build-arg no_proxy=${no_proxy} \ diff --git a/evals/evaluation/agent_eval/crag_eval/docker/launch_eval_container.sh b/evals/evaluation/agent_eval/crag_eval/docker/launch_eval_container.sh new file mode 100644 index 00000000..8698f452 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/docker/launch_eval_container.sh @@ -0,0 +1,7 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +volume=$WORKDIR +host_ip=$(hostname -I | awk '{print $1}') + +docker run -it -v $volume:/home/user/ -e WORKDIR=/home/user -e HF_HOME=/home/user/hf_cache -e host_ip=$host_ip -e http_proxy=$http_proxy -e https_proxy=$https_proxy crag-eval:latest diff --git a/evals/evaluation/agent_eval/crag_eval/docker/requirements.txt b/evals/evaluation/agent_eval/crag_eval/docker/requirements.txt new file mode 100644 index 00000000..b32606b7 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/docker/requirements.txt @@ -0,0 +1,8 @@ +datasets +evaluate +jieba +langchain-community +langchain-huggingface +pandas +ragas +sentence_transformers diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/process_data.py b/evals/evaluation/agent_eval/crag_eval/preprocess_data/process_data.py new file mode 100644 index 00000000..f8f4bb39 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/preprocess_data/process_data.py @@ -0,0 +1,120 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import argparse +import json +import os + +import tqdm + + +def split_text(text, chunk_size=2000, chunk_overlap=400): + from langchain_text_splitters import RecursiveCharacterTextSplitter + + text_splitter = RecursiveCharacterTextSplitter( + # Set a really small chunk size, just to show. + chunk_size=chunk_size, + chunk_overlap=chunk_overlap, + length_function=len, + is_separator_regex=False, + separators=["\n\n", "\n", ".", "!"], + ) + return text_splitter.split_text(text) + + +def process_html_string(text): + from bs4 import BeautifulSoup + + # print(text) + soup = BeautifulSoup(text, features="html.parser") + + # kill all script and style elements + for script in soup(["script", "style"]): + script.extract() # rip it out + + # get text + text_content = soup.get_text() + + # break into lines and remove leading and trailing space on each + lines = (line.strip() for line in text_content.splitlines()) + # break multi-headlines into a line each + chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) + # drop blank lines + final_text = "\n".join(chunk for chunk in chunks if chunk) + # print(final_text) + return final_text + + +def preprocess_data(input_file): + snippet = [] + return_data = [] + n = 0 + with open(input_file, "r") as f: + for line in f: + data = json.loads(line) + + # search results snippets --> retrieval corpus docs + docs = data["search_results"] + + for doc in docs: + # chunks = split_text(doc['page_snippet']) + # for chunk in chunks: + # snippet.append({ + # "query": data['query'], + # "domain": data['domain'], + # "doc":chunk}) + snippet.append({"query": data["query"], "domain": data["domain"], "doc": doc["page_snippet"]}) + + # qa pairs without search results + output = {} + for k, v in data.items(): + if k != "search_results": + output[k] = v + return_data.append(output) + + n += 1 + if n == 10: + break + + return snippet, return_data + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--filedir", type=str, default=None) + parser.add_argument("--docout", type=str, default=None) + parser.add_argument("--qaout", type=str, default=None) + # parser.add_argument('--chunk_size', type=int, default=10000) + # parser.add_argument('--chunk_overlap', type=int, default=0) + + args = parser.parse_args() + + if not os.path.exists(args.docout): + os.makedirs(args.docout) + + if not os.path.exists(args.qaout): + os.makedirs(args.qaout) + + data_files = os.listdir(args.filedir) + + qa_pairs = [] + docs = [] + for file in tqdm.tqdm(data_files): + file = os.path.join(args.filedir, file) + doc, data = preprocess_data(file) + docs.extend(doc) + qa_pairs.extend(data) + + # group by domain + domains = ["finance", "music", "movie", "sports", "open"] + + for domain in domains: + with open(os.path.join(args.docout, "crag_docs_" + domain + ".jsonl"), "w") as f: + for doc in docs: + if doc["doc"] != "" and doc["domain"] == domain: + f.write(json.dumps(doc) + "\n") + + with open(os.path.join(args.qaout, "crag_qa_" + domain + ".jsonl"), "w") as f: + for d in qa_pairs: + if d["domain"] == domain: + f.write(json.dumps(d) + "\n") diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_data_preprocess.sh b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_data_preprocess.sh new file mode 100644 index 00000000..780f5f29 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_data_preprocess.sh @@ -0,0 +1,8 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +FILEDIR=$WORKDIR/datasets/crag_task_3_dev_v4 +DOCOUT=$WORKDIR/datasets/crag_docs/ +QAOUT=$WORKDIR/datasets/crag_qas/ + +python3 process_data.py --filedir $FILEDIR --docout $DOCOUT --qaout $QAOUT diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_sample_data.sh b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_sample_data.sh new file mode 100644 index 00000000..dd104326 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/preprocess_data/run_sample_data.sh @@ -0,0 +1,6 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +FILEDIR=$WORKDIR/datasets/crag_qas + +python3 sample_data.py --filedir $FILEDIR diff --git a/evals/evaluation/agent_eval/crag_eval/preprocess_data/sample_data.py b/evals/evaluation/agent_eval/crag_eval/preprocess_data/sample_data.py new file mode 100644 index 00000000..51621194 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/preprocess_data/sample_data.py @@ -0,0 +1,33 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import argparse +import json +import os + +import pandas as pd +import tqdm + + +def sample_data(input_file, output_file): + df = pd.read_json(input_file, lines=True, convert_dates=False) + # group by `question_type` and `static_or_dynamic` + df_grouped = df.groupby(["question_type", "static_or_dynamic"]) + # sample 5 rows from each group if there are more than 5 rows else return all rows + df_sampled = df_grouped.apply(lambda x: x.sample(5) if len(x) > 5 else x) + # save sampled data to output file + df_sampled.to_json(output_file, orient="records", lines=True) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--filedir", type=str, default=None) + + args = parser.parse_args() + + data_files = os.listdir(args.filedir) + for file in tqdm.tqdm(data_files): + print(file) + file = os.path.join(args.filedir, file) + output_file = file.replace(".jsonl", "_sampled.jsonl") + sample_data(file, output_file) diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/generate_answers.py b/evals/evaluation/agent_eval/crag_eval/run_benchmark/generate_answers.py new file mode 100644 index 00000000..19f7f747 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/generate_answers.py @@ -0,0 +1,83 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import argparse +import json +import os + +import pandas as pd +import requests + + +def get_test_data(args): + if args.query_file.endswith(".jsonl"): + df = pd.read_json(args.query_file, lines=True, convert_dates=False) + elif args.query_file.endswith(".csv"): + df = pd.read_csv(args.query_file) + return df + + +def generate_answer(url, prompt): + proxies = {"http": ""} + payload = { + "query": prompt, + } + response = requests.post(url, json=payload, proxies=proxies) + answer = response.json()["text"] + return answer + + +def save_results(output_file, output_list): + with open(output_file, "w") as f: + for output in output_list: + f.write(json.dumps(output)) + f.write("\n") + + +def save_as_csv(output): + df = pd.read_json(output, lines=True, convert_dates=False) + df.to_csv(output.replace(".jsonl", ".csv"), index=False) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--endpoint_url", type=str, default=None, help="url of the agent QnA system endpoint") + parser.add_argument("--query_file", type=str, default=None, help="query jsonl file") + parser.add_argument("--output_file", type=str, default="output.jsonl", help="output jsonl file") + args = parser.parse_args() + + url = args.endpoint_url + + df = get_test_data(args) + # df = df.head() # for validation purpose + + if not os.path.exists(os.path.dirname(args.output_file)): + os.makedirs(os.path.dirname(args.output_file)) + + output_list = [] + n = 0 + for _, row in df.iterrows(): + q = row["query"] + t = row["query_time"] + prompt = "Question: {}\nThe question was asked at: {}".format(q, t) + print("******Query:\n", prompt) + print("******Agent is working on the query") + answer = generate_answer(url, prompt) + print("******Answer from agent:\n", answer) + print("=" * 50) + output_list.append( + { + "query": q, + "query_time": t, + "ref_answer": row["answer"], + "answer": answer, + "question_type": row["question_type"], + "static_or_dynamic": row["static_or_dynamic"], + } + ) + save_results(args.output_file, output_list) + # n += 1 + # if n > 1: + # break + save_results(args.output_file, output_list) + save_as_csv(args.output_file) diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/grade_answers.py b/evals/evaluation/agent_eval/crag_eval/run_benchmark/grade_answers.py new file mode 100644 index 00000000..8f95d497 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/grade_answers.py @@ -0,0 +1,91 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import argparse +import os + +import pandas as pd +from ragas.metrics import answer_correctness + +from evals.metrics.ragas import RagasMetric + + +def convert_data_format_for_ragas(data): + output = { + "question": data["query"].tolist(), + "answer": data["answer"].tolist(), + "ground_truth": data["ref_answer"].tolist(), + "contexts": [["dummy_context"] for _ in range(data["query"].shape[0])], + } + return output + + +def make_list_of_test_cases(data): + output = [] + for _, row in data.iterrows(): + output.append( + { + "question": [row["query"]], + "answer": [row["answer"]], + "ground_truth": [row["ref_answer"]], + "contexts": [["dummy_context"]], + } + ) + return output + + +def grade_answers(args, test_case): + from langchain_community.embeddings import HuggingFaceBgeEmbeddings + + print("==============getting embeddings==============") + embeddings = HuggingFaceBgeEmbeddings(model_name=args.embed_model) + print("==============initiating metric==============") + metric = RagasMetric(threshold=0.5, metrics=["answer_correctness"], model=args.llm_endpoint, embeddings=embeddings) + print("==============start grading==============") + + if args.batch_grade: + metric.measure(test_case) + return metric.score["answer_correctness"] + else: + scores = [] + for case in test_case: + metric.measure(case) + scores.append(metric.score["answer_correctness"]) + print(metric.score) + print("-" * 50) + return scores + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--embed_model", type=str, default="BAAI/bge-base-en-v1.5") + parser.add_argument("--llm_endpoint", type=str, default="http://localhost:8008") + parser.add_argument("--filedir", type=str, help="Path to the file containing the data") + parser.add_argument("--filename", type=str, help="Name of the file containing the data") + parser.add_argument( + "--batch_grade", + action="store_true", + help="Grade the answers in batch and get an aggregated score for the entire dataset", + ) + args = parser.parse_args() + + data = pd.read_csv(os.path.join(args.filedir, args.filename)) + + if args.batch_grade: + test_case = convert_data_format_for_ragas(data) + else: + test_case = make_list_of_test_cases(data) + + # print(test_case) + + scores = grade_answers(args, test_case) + + # save the scores + if args.batch_grade: + print("Aggregated answer correctness score: ", scores) + else: + data["answer_correctness"] = scores + print("Average answer correctness score: ", data["answer_correctness"].mean()) + output_file = args.filename.split(".")[0] + "_graded.csv" + data.to_csv(os.path.join(args.filedir, output_file), index=False) + print("Scores saved to ", os.path.join(args.filedir, output_file)) diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge-gaudi.yaml b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge-gaudi.yaml new file mode 100644 index 00000000..572011ef --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge-gaudi.yaml @@ -0,0 +1,26 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +services: + tgi-service: + image: ghcr.io/huggingface/tgi-gaudi:latest + container_name: tgi-server + ports: + - "8085:80" + volumes: + - ${HF_CACHE_DIR}:/data + environment: + no_proxy: ${no_proxy} + http_proxy: ${http_proxy} + https_proxy: ${https_proxy} + HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN} + HF_HUB_DISABLE_PROGRESS_BARS: 1 + HF_HUB_ENABLE_HF_TRANSFER: 0 + HABANA_VISIBLE_DEVICES: all + OMPI_MCA_btl_vader_single_copy_mechanism: none + PT_HPU_ENABLE_LAZY_COLLECTIVES: true + runtime: habana + cap_add: + - SYS_NICE + ipc: host + command: --model-id ${LLM_MODEL_ID} --max-input-length 4096 --max-total-tokens 8192 --sharded true --num-shard 4 diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge.yaml b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge.yaml new file mode 100644 index 00000000..a954098e --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/docker-compose-llm-judge.yaml @@ -0,0 +1,22 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +version: "3.8" + +services: + tgi_service: + image: ghcr.io/huggingface/text-generation-inference:2.1.0 + container_name: tgi-service + ports: + - "8085:80" + volumes: + - ${HF_CACHE_DIR}:/data + shm_size: 1g + environment: + no_proxy: ${no_proxy} + http_proxy: ${http_proxy} + https_proxy: ${https_proxy} + HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN} + HF_HUB_DISABLE_PROGRESS_BARS: 1 + HF_HUB_ENABLE_HF_TRANSFER: 0 + command: --model-id ${LLM_MODEL_ID} diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/launch_llm_judge_endpoint.sh b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/launch_llm_judge_endpoint.sh new file mode 100644 index 00000000..0cb08d8f --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/launch_llm_judge_endpoint.sh @@ -0,0 +1,7 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +export LLM_MODEL_ID="meta-llama/Meta-Llama-3-70B-Instruct" +export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} +export HF_CACHE_DIR=${HF_CACHE_DIR} +docker compose -f docker-compose-llm-judge-gaudi.yaml up -d diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/test_llm_endpoint.py b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/test_llm_endpoint.py new file mode 100644 index 00000000..c23f6af9 --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/test_llm_endpoint.py @@ -0,0 +1,19 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import os + +from langchain_huggingface import HuggingFaceEndpoint + +host_ip = os.environ.get("host_ip", "localhost") +url = "http://{host_ip}:8085".format(host_ip=host_ip) +print(url) + +model = HuggingFaceEndpoint( + endpoint_url=url, + task="text-generation", + max_new_tokens=10, + do_sample=False, +) + +print(model.invoke("what is deep learing?")) diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_generate_answer.sh b/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_generate_answer.sh new file mode 100644 index 00000000..ee863bba --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_generate_answer.sh @@ -0,0 +1,16 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +host_ip=$host_ip # change this to the host IP of the agent +port=9095 # change this to the port of the agent +endpoint=${port}/v1/chat/completions # change this to the endpoint of the agent +URL="http://${host_ip}:${endpoint}" +echo "AGENT ENDPOINT URL: ${URL}" + +QUERYFILE=$WORKDIR/datasets/crag_qas/crag_qa_music_sampled.jsonl +OUTPUTFILE=$WORKDIR/datasets/crag_results/crag_music_sampled_results.jsonl + +python3 generate_answers.py \ +--endpoint_url ${URL} \ +--query_file $QUERYFILE \ +--output_file $OUTPUTFILE diff --git a/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_grading.sh b/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_grading.sh new file mode 100644 index 00000000..5431d39b --- /dev/null +++ b/evals/evaluation/agent_eval/crag_eval/run_benchmark/run_grading.sh @@ -0,0 +1,11 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +FILEDIR=$WORKDIR/datasets/crag_results/ +FILENAME=crag_music_sampled_results.csv +LLM_ENDPOINT=http://${host_ip}:8085 # change host_ip to the IP of LLM endpoint + +python3 grade_answers.py \ +--filedir $FILEDIR \ +--filename $FILENAME \ +--llm_endpoint $LLM_ENDPOINT \ diff --git a/evals/metrics/ragas/ragas.py b/evals/metrics/ragas/ragas.py index 9525ce07..35449c08 100644 --- a/evals/metrics/ragas/ragas.py +++ b/evals/metrics/ragas/ragas.py @@ -7,9 +7,9 @@ import os from typing import Dict, Optional, Union -from langchain_community.llms import HuggingFaceEndpoint from langchain_core.embeddings import Embeddings from langchain_core.language_models import BaseLanguageModel +from langchain_huggingface import HuggingFaceEndpoint def format_ragas_metric_name(name: str): @@ -42,14 +42,7 @@ def __init__( "reference_free_rubrics_score", ] - async def a_measure(self, test_case: Dict): - return self.measure(test_case) - - def measure(self, test_case: Dict): - - # sends to server try: - from ragas import evaluate from ragas.metrics import ( answer_correctness, answer_relevancy, @@ -60,7 +53,6 @@ def measure(self, test_case: Dict): faithfulness, reference_free_rubrics_score, ) - except ModuleNotFoundError: raise ModuleNotFoundError("Please install ragas to use this metric. `pip install ragas`.") @@ -68,6 +60,7 @@ def measure(self, test_case: Dict): from datasets import Dataset except ModuleNotFoundError: raise ModuleNotFoundError("Please install dataset") + self.metrics_instance = { "answer_correctness": answer_correctness, "answer_relevancy": answer_relevancy, @@ -85,14 +78,17 @@ def measure(self, test_case: Dict): print("OPENAI_API_KEY is provided, ragas initializes the model by OpenAI.") self.model = None if isinstance(self.model, str): - chat_model = HuggingFaceEndpoint( + print("LLM endpoint: ", self.model) + self.chat_model = HuggingFaceEndpoint( endpoint_url=self.model, - timeout=600, + task="text-generation", + max_new_tokens=1024, + do_sample=False, ) else: - chat_model = self.model - # Create a dataset from the test case - # Convert the Dict to a format compatible with Dataset + self.chat_model = self.model + + # initialize metrics if self.metrics is not None: tmp_metrics = [] # check supported list @@ -110,8 +106,10 @@ def measure(self, test_case: Dict): if metric == "answer_relevancy" and self.embeddings is None: raise ValueError("answer_relevancy metric need provide embeddings model.") tmp_metrics.append(self.metrics_instance[metric]) + self.metrics = tmp_metrics - else: + + else: # default metrics self.metrics = [ answer_relevancy, faithfulness, @@ -121,6 +119,19 @@ def measure(self, test_case: Dict): context_recall, ] + async def a_measure(self, test_case: Dict): + return self.measure(test_case) + + def measure(self, test_case: Dict): + from ragas import evaluate + + try: + from datasets import Dataset + except ModuleNotFoundError: + raise ModuleNotFoundError("Please install dataset") + + # Create a dataset from the test case + # Convert the Dict to a format compatible with Dataset data = { "question": test_case["question"], "contexts": test_case["contexts"], @@ -132,7 +143,7 @@ def measure(self, test_case: Dict): self.score = evaluate( dataset, metrics=self.metrics, - llm=chat_model, + llm=self.chat_model, embeddings=self.embeddings, ) return self.score diff --git a/tests/test_ragas.py b/tests/test_ragas.py index e11835ad..3376b0b5 100644 --- a/tests/test_ragas.py +++ b/tests/test_ragas.py @@ -4,14 +4,18 @@ # SPDX-License-Identifier: Apache-2.0 +import os import unittest from evals.metrics.ragas import RagasMetric +host_ip = os.getenv("host_ip", "localhost") +port = os.getenv("port", "8008") + class TestRagasMetric(unittest.TestCase): - @unittest.skip("need pass localhost id") + # @unittest.skip("need pass localhost id") def test_ragas(self): # Replace this with the actual output from your LLM application actual_output = "We offer a 30-day full refund at no extra cost." @@ -24,7 +28,7 @@ def test_ragas(self): from langchain_community.embeddings import HuggingFaceBgeEmbeddings embeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-base-en-v1.5") - metric = RagasMetric(threshold=0.5, model="http://localhost:8008", embeddings=embeddings) + metric = RagasMetric(threshold=0.5, model=f"http://{host_ip}:{port}", embeddings=embeddings) test_case = { "question": ["What if these shoes don't fit?"], "answer": [actual_output],