-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Ragaaf - adding new metric 'context relevance' (#185)
* small fix for ragas.py Signed-off-by: aasavari <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed error when metrics arg is used Signed-off-by: aasavari <[email protected]> * updated README Signed-off-by: aasavari <[email protected]> * added key features Signed-off-by: aasavari <[email protected]> * edited formatting Signed-off-by: aasavari <[email protected]> * improved readability Signed-off-by: aasavari <[email protected]> * improved note in model section Signed-off-by: aasavari <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * spell check Signed-off-by: aasavari <[email protected]> * adding context relevance metric to RAGAAF Signed-off-by: aasavari <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: aasavari <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
4c8f048
commit f995c9c
Showing
2 changed files
with
78 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1,89 @@ | ||
# RAGAAF (RAG assessment - Annotation Free) | ||
|
||
We introduce - RAGAAF, Intel's easy-to-use, flexible, opensource and annotation-free RAG evaluation tool using LLM-as-a-judge while benefitting from Intel's Gaudi2 AI accelator chips. | ||
Intel's RAGAAF toolkit employs opensource LLM-as-a-judge technique on Intel's Gaudi2 AI accelator chips to perform annotation-free evaluation of RAG. | ||
|
||
## Overview | ||
### Data | ||
RAGAAF is best suited for Long Form Question Answering (LFQA) datasets where you want to gauge quality and factualness of the answer via LLM's intelligence. Here, you can use benchmarking datasets or bring your own custom datasets. Please make sure to set `field_map` to map AutoEval fields such as "question" to your dataset's corresponding field like "query". | ||
> Note : To use benchmarking datasets, set argument `data_mode=benchmarking`. Similarly, to use custom datasets, set `data_mode=local`. | ||
### Model | ||
AutoEval can run in 3 evaluation modes - | ||
1. `evaluation_mode="endpoint"` uses HuggingFace endpoint. | ||
- We recommend launching a HuggingFace endpoint on Gaudi AI accelerator machines to ensure maximum usage and performance. | ||
- To launch HF endpoint on Gaudi2, please follow the 2-step instructions here - [tgi-gaudi](https://github.com/huggingface/tgi-gaudi). | ||
- Pass your endpoint url as `model_name` argument. | ||
2. `evaluation_mode="openai"` uses openai backend. | ||
- Please set your `openai_key` and your choice of model as `model_name` argument. | ||
3. `evaluation_mode="local"` uses your local hardware. | ||
- Set `hf_token` argument and set your favourite open-source model in `model_name` argument. | ||
- GPU usage will be prioritized after checking it's availability. If GPU is unavailable, the model will run on CPU. | ||
## Metrics | ||
AutoEval provides 4 metrics - factualness, correctness, relevance and readability. You can also bring your own metrics and grading scales. Don't forget to add your metric to `evaluation_metrics` argument. | ||
## Generation configuration | ||
We provide recommended generation parameters after experimenting with different LLMs. If you'd like to edit them to your requirement, please set generation parameters in `GENERATION_CONFIG` in `run_eval.py`. | ||
## Key features | ||
✨ Annotation Free evaluation (ground truth answers are not required). </br> | ||
🧠 Provides score and reasoning for each metric allowing a deep dive into LLM's thought process. </br> | ||
🤗 Quick access to latest innovations in opensource Large Language Models. </br> | ||
⏩ Seamlessly boost performance using Intel's powerful AI accelerator chips - Gaudi. </br> | ||
✍️ Flexibility to bring your own metrics, grading rubrics and datasets. | ||
|
||
## Run using HF endpoint | ||
```python3 | ||
# step 1 : choose your dataset -- local or benchmarking | ||
dataset = "explodinggradients/ragas-wikiqa" | ||
data_mode = "benchmarking" | ||
field_map = {"question": "question", "answer": "generated_with_rag", "context": "context"} | ||
|
||
# step 2 - choose your favourite LLM and hardware | ||
|
||
# evaluation_mode = "openai" | ||
# model_name = "gpt-4o" | ||
# openai_key = "<add your openai key>" | ||
## Run RAGAAF | ||
|
||
# evaluation_mode = "endpoint" | ||
# model_name = f"http://{host_ip}:{port}" | ||
### 1. Data | ||
We provide 3 modes for data loading - `benchmarking`, `unit` and `local` to support benchmarking datasets, unit test cases and your custom datasets. | ||
|
||
evaluation_mode = "local" | ||
model_name = "meta-llama/Llama-3.2-1B-Instruct" | ||
hf_token = "<add your HF token>" | ||
Let us see how to load a unit test case. | ||
```python3 | ||
# load your dataset | ||
dataset = "unit_data" # name of the dataset | ||
data_mode = "unit" # mode for data loading | ||
field_map = { | ||
"question": "question", | ||
"answer": "actual_output", | ||
"context": "contexts", | ||
} # map your data field such as "actual_output" to RAGAAF field "answer" | ||
|
||
# step 3 - choose metrics of your choice, you can also add custom metrics | ||
# your desired unit test case | ||
question = "What if these shoes don't fit?" | ||
actual_output = "We offer a 30-day full refund at no extra cost." | ||
contexts = [ | ||
"All customers are eligible for a 30 day full refund at no extra cost.", | ||
"We can only process full refund upto 30 day after the purchase.", | ||
] | ||
examples = [{"question": question, "actual_output": actual_output, "contexts": contexts}] | ||
``` | ||
### 2. Launch endpoint on Gaudi | ||
Please launch an endpoint on Gaudi2 using the most popular LLMs such as `mistralai/Mixtral-8x7B-Instruct-v0.1` by following the 2 step instructions here - [tgi-gaudi](https://github.com/huggingface/tgi-gaudi). | ||
### 3. Model | ||
We provide 3 evaluation modes - `endpoint`, `local` (supports CPU and GPU), `openai`. | ||
```python3 | ||
# choose your favourite LLM and hardware | ||
host_ip = os.getenv("host_ip", "localhost") | ||
port = os.getenv("port", "<your port where the endpoint is active>") | ||
evaluation_mode = "endpoint" | ||
model_name = f"http://{host_ip}:{port}" | ||
``` | ||
> `local` evaluation mode uses your local hardware (GPU usage is prioritized over CPU when available). Don't forget to set `hf_token` argument and your favourite open-source model in `model_name` argument. </br> | ||
> `openai` evaluation mode uses openai backend. Please set your `openai_key` as argument and your choice of OpenAI model as `model_name` argument. | ||
### 4. Metrics | ||
```python3 | ||
# choose metrics of your choice, you can also add custom metrics | ||
evaluation_metrics = ["factualness", "relevance", "correctness", "readability"] | ||
``` | ||
### 5. Evaluation | ||
```python3 | ||
from evals.metrics.ragaaf import AnnotationFreeEvaluate | ||
|
||
# step 4 - run evaluation | ||
evaluator = AnnotationFreeEvaluate( | ||
dataset=dataset, | ||
examples=examples, | ||
data_mode=data_mode, | ||
field_map=field_map, | ||
evaluation_mode=evaluation_mode, | ||
model_name=model_name, | ||
evaluation_metrics=evaluation_metrics, | ||
# openai_key=openai_key, | ||
hf_token=hf_token, | ||
debug_mode=True, | ||
# hf_token=hf_token, | ||
) | ||
|
||
responses = evaluator.measure() | ||
|
||
for response in responses: | ||
print(response) | ||
``` | ||
That's it! For troubleshooting, please submit an issue and we will get right on it. | ||
## Customizations | ||
1. If you'd like to change generation parameters, please see in `GENERATION_CONFIG` in `run_eval.py`. | ||
2. If you'd like to add a new metric, please mimic an existing metric, e.g., `./prompt_templates/correctness.py` | ||
```python3 | ||
class MetricName: | ||
name = "metric_name" | ||
required_columns = ["answer", "context", "question"] # the fields your metric needs | ||
template = """- <metric_name> : <metric_name> measures <note down what you'd like this metric to measure>. | ||
- Score 1: <add your grading rubric for score 1>. | ||
- Score 2: <add your grading rubric for score 2>. | ||
- Score 3: <add your grading rubric for score 3>. | ||
- Score 4: <add your grading rubric for score 4>. | ||
- Score 5: <add your grading rubric for score 5>.""" | ||
``` |
13 changes: 13 additions & 0 deletions
13
evals/metrics/ragaaf/prompt_templates/context_relevance.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
|
||
class ContextRelevance: | ||
name = "context_relevance" | ||
required_columns = ["question", "context"] | ||
template = """- Context Relevance: Context Relevance measures how well the context relates to the question. | ||
- Score 1: The context doesn't mention anything about the question or is completely irrelevant to the question. | ||
- Score 2: The context only identifies the domain (e.g. cnvrg) mentioned in the question and provides information from the correct domain. But, the context does not address the question itself and the point of the question is completely missed by it. | ||
- Score 3: The context correctly identifies the domain and essence of the question but the details in the context are not relevant to the focus of the question. | ||
- Score 4: The context correctly identifies domain mentioned the question and essence of the question as well as stays consistent with both of them. But there is some part of the context that is not relevant to the question or it's topic or it's essence. This irrelevant part is damaging the overall relevance of the context. | ||
- Score 5: The context is completely relevant to the question and the details do not deviate from the essence of the question. There are no parts of the context that are irrelevant or unnecessary for the given question.""" |