Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate official hotpot EM and F1 scores #292

Merged
merged 1 commit into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 70 additions & 12 deletions evals/deepeval_metrics.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,72 @@
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import BaseMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

from evals.official_hotpot_metrics import exact_match_score, f1_score

correctness_metric = GEval(
name="Correctness",
model="gpt-4o-mini",
evaluation_params=[
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT
],
evaluation_steps=[
"Determine whether the actual output is factually correct based on the expected output."
]
)
name="Correctness",
model="gpt-4o-mini",
evaluation_params=[
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT
],
evaluation_steps=[
"Determine whether the actual output is factually correct based on the expected output."
]
)


class f1_score_metric(BaseMetric):

"""F1 score taken directly from the official hotpot benchmark
implementation and wrapped into a deepeval metric."""

def __init__(self, threshold: float = 0.5):
self.threshold = threshold

def measure(self, test_case: LLMTestCase):
f1, precision, recall = f1_score(
prediction=test_case.actual_output,
ground_truth=test_case.expected_output,
)
self.score = f1
self.success = self.score >= self.threshold
return self.score

# Reusing regular measure as async F1 score is not implemented
async def a_measure(self, test_case: LLMTestCase):
return self.measure(test_case)

def is_successful(self):
return self.success

@property
def __name__(self):
return "Official hotpot F1 score"

Comment on lines +43 to +45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Rename __name__ Property to Avoid Conflicts

Overriding the special __name__ attribute can lead to confusion and unexpected behavior since __name__ is a built-in attribute in Python.

Rename the property to avoid shadowing the built-in attribute:

 @property
-def __name__(self):
+def name(self):
     return "Official hotpot F1 score"

Ensure to update any references to __name__ accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@property
def __name__(self):
return "Official hotpot F1 score"
@property
def name(self):
return "Official hotpot F1 score"

class em_score_metric(BaseMetric):
Comment on lines +19 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential Blocking in Asynchronous Methods

The a_measure methods in both f1_score_metric and em_score_metric classes directly call the synchronous measure method. This could block the event loop if measure involves long-running operations.

Consider making the measure method asynchronous if it performs I/O-bound or time-consuming tasks. Alternatively, you can offload the synchronous method to a thread using asyncio.to_thread:

async def a_measure(self, test_case: LLMTestCase):
    return await asyncio.to_thread(self.measure, test_case)


"""Exact Match score taken directly from the official hotpot benchmark
implementation and wrapped into a deepeval metric."""

def __init__(self, threshold: float = 0.5):
self.threshold = threshold

def measure(self, test_case: LLMTestCase):
self.score = exact_match_score(
prediction=test_case.actual_output,
ground_truth=test_case.expected_output,
)
self.success = self.score >= self.threshold
return self.score

Comment on lines +55 to +61
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Boolean Comparison with Float Threshold

In em_score_metric, self.score is a boolean value resulting from exact_match_score, but it's being compared to a float threshold. This can lead to unintended behavior.

Convert the boolean to a float for a meaningful comparison:

 def measure(self, test_case: LLMTestCase):
     self.score = exact_match_score(
         prediction=test_case.actual_output,
         ground_truth=test_case.expected_output,
     )
+    self.score = float(self.score)
     self.success = self.score >= self.threshold
     return self.score

Alternatively, set the threshold to 1.0 to reflect that only a perfect match is considered successful.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def measure(self, test_case: LLMTestCase):
self.score = exact_match_score(
prediction=test_case.actual_output,
ground_truth=test_case.expected_output,
)
self.success = self.score >= self.threshold
return self.score
def measure(self, test_case: LLMTestCase):
self.score = exact_match_score(
prediction=test_case.actual_output,
ground_truth=test_case.expected_output,
)
self.score = float(self.score)
self.success = self.score >= self.threshold
return self.score

# Reusing regular measure as async F1 score is not implemented
async def a_measure(self, test_case: LLMTestCase):
return self.measure(test_case)

def is_successful(self):
return self.success

@property
def __name__(self):
return "Official hotpot EM score"
6 changes: 5 additions & 1 deletion evals/llm_as_a_judge.py → evals/eval_on_hotpot.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,9 @@ async def eval_on_hotpotQA(answer_provider, num_samples, eval_metric):

parser.add_argument("--with_cognee", action="store_true")
parser.add_argument("--num_samples", type=int, default=500)
parser.add_argument("--metric", type=str, default="correctness_metric")
parser.add_argument("--metric", type=str, default="correctness_metric",
help="Valid options are Deepeval metrics (e.g. AnswerRelevancyMetric) \
and metrics defined in evals/deepeval_metrics.py, e.g. f1_score_metric")

args = parser.parse_args()

Expand All @@ -120,6 +122,8 @@ async def eval_on_hotpotQA(answer_provider, num_samples, eval_metric):
metric = metric_cls()
except AttributeError:
metric = getattr(evals.deepeval_metrics, args.metric)
if isinstance(metric, type):
metric = metric()

if args.with_cognee:
answer_provider = answer_with_cognee
Expand Down
86 changes: 86 additions & 0 deletions evals/official_hotpot_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
These are the official evaluation metrics for HotpotQA taken from https://hotpotqa.github.io/
"""

import re
import string
import sys
from collections import Counter

import ujson as json


def normalize_answer(s):

def remove_articles(text):
return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):
return ' '.join(text.split())

def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)

def lower(text):
return text.lower()

return white_space_fix(remove_articles(remove_punc(lower(s))))


def f1_score(prediction, ground_truth):
normalized_prediction = normalize_answer(prediction)
normalized_ground_truth = normalize_answer(ground_truth)

ZERO_METRIC = (0, 0, 0)

if normalized_prediction in ['yes', 'no', 'noanswer'] and normalized_prediction != normalized_ground_truth:
return ZERO_METRIC
if normalized_ground_truth in ['yes', 'no', 'noanswer'] and normalized_prediction != normalized_ground_truth:
return ZERO_METRIC

prediction_tokens = normalized_prediction.split()
ground_truth_tokens = normalized_ground_truth.split()
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return ZERO_METRIC
precision = 1.0 * num_same / len(prediction_tokens)
recall = 1.0 * num_same / len(ground_truth_tokens)
f1 = (2 * precision * recall) / (precision + recall)
return f1, precision, recall


def exact_match_score(prediction, ground_truth):
return (normalize_answer(prediction) == normalize_answer(ground_truth))

def update_answer(metrics, prediction, gold):
em = exact_match_score(prediction, gold)
f1, prec, recall = f1_score(prediction, gold)
metrics['em'] += float(em)
metrics['f1'] += f1
metrics['prec'] += prec
metrics['recall'] += recall
return em, prec, recall

def update_sp(metrics, prediction, gold):
cur_sp_pred = set(map(tuple, prediction))
gold_sp_pred = set(map(tuple, gold))
tp, fp, fn = 0, 0, 0
for e in cur_sp_pred:
if e in gold_sp_pred:
tp += 1
else:
fp += 1
for e in gold_sp_pred:
if e not in cur_sp_pred:
fn += 1
prec = 1.0 * tp / (tp + fp) if tp + fp > 0 else 0.0
recall = 1.0 * tp / (tp + fn) if tp + fn > 0 else 0.0
f1 = 2 * prec * recall / (prec + recall) if prec + recall > 0 else 0.0
em = 1.0 if fp + fn == 0 else 0.0
metrics['sp_em'] += em
metrics['sp_f1'] += f1
metrics['sp_prec'] += prec
metrics['sp_recall'] += recall
return em, prec, recall
Loading