[alpha] Implementation of Semantic metrics and more basic metrics #6

NISH1001 · 2023-02-27T17:22:27Z

Major Changes

metrics.semantics.SemanticMetric type is added from which we have 2 implementations: BertScore and BartScore
two new basic metrics are added:
- metrics.basics.ExactMatchMetric: Flattens all the references (in the multi-reference scenario) and performs an exact match
- metrics.basics.ConfusionMatrix: Generates confusion matrix for the provided set of labels (obtained from both references and predictions) using scikit-learn
initial unit tests (based on pytest) are added at tests/. (For now, it only tests for the format_to_jury function (it's very basic for now).

Minor Changes

models.defaults.DefaultQAModelWrapper now accepts the device param to run on CPU or GPU
models._base.HFPipelineWrapper has a pipeline property that returns the pipeline
evalem.misc.datasets.get_squad_v2(...) utility function is added to load squad-v2 dataset

Usages

from evalem.structures import (
    PredictionDTO,
    ReferenceDTO,
    EvaluationDTO,
    PredictionInstance,
    ReferenceInstance
)

from evalem.metrics import (
    Metric,
    AccuracyMetric,
    PrecisionMetric,
    RecallMetric,
    F1Metric,
    ConfusionMatrix,
    ExactMatchMetric,
    BertScore,
    BartScore,
)

from typing import Iterable, Type, Mapping, Union, List

from evalem.models import (
    DefaultQAModelWrapper,
    HFPipelineWrapper,
    ModelWrapper
)

from evalem.misc.datasets import get_squad_v2

# wrapped_model = HFPipelineWrapper(
#     pipeline("question-answering"),
# )

wrapped_model = DefaultQAModelWrapper(device="cpu")

def run_pipeline(
    model: Type[ModelWrapper],
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

data = get_squad_v2("validation", nsamples=100)

evaluators = [
    Evaluator(metrics=[
        AccuracyMetric(),
        ConfusionMatrix(),
        ExactMatchMetric(),
    ]),
    Evaluator(metrics=[
        BertScore(device="mps", model_type="distilbert-base-uncased"),
        BartScore(device="mps")
    ])
]

results = run_pipeline(
    wrapped_model,
    evaluators,
    data["inputs"],
    data["references"]
)

See `evalem.misc.datasets`. We have `datasets.get_squad_v2(...)` function.

Now we have `metrics.semantics.SemanticMetric`. There are 2 implementation for now: - `metrics.semantics.BertScore` - `metrics.semantics.BartScore`

muthukumaranR · 2023-03-02T00:05:43Z

evalem/metrics/semantics.py

+
+    def __init__(
+        self,
+        model_type: str = "roberta-large",


I suggest default be the ubiquitous bert-base-uncased

or even the distilled version as you pointed out in the example distilbert-base-uncased

or even the distilled version as you pointed out in the example distilbert-base-uncased

Ah ya... Good catch. Actually the default Jury's was roberta-large. So, I think I went with that. But I agree to set to bert as a default (BERT for the win..haha)

muthukumaranR · 2023-03-02T00:08:43Z

evalem/metrics/_base.py

+    def __init__(self, device: str = "cpu", debug: bool = False) -> None:
+        super().__init__(debug=debug)
+        self.device = device
+
    @abstractmethod
    def compute(


Thinking out loud: evaluate() seems befitting to compute()

Thinking out loud: evaluate() seems befitting to compute()

Here, I actually wanted to differentiate Evaluator.evaluate(...) and Metric.compute(...) methods. Didn't want to replicate the same interface name to 2 different components. Do you think we should have duplicate interface names?

muthukumaranR · 2023-03-02T00:10:24Z

evalem/misc/datasets.py

+    data = load_dataset("squad_v2")[data_type]
+    data = data.shuffle(seed=42) if shuffle else data
+    data = data.select(range(nsamples)) if nsamples > 0 else data
+
+    inputs = [dict(question=d["question"], context=d["context"]) for d in data]
+    references = [d["answers"]["text"] for d in data]
+
+    inputs, references = zip(*filter(lambda x: len(x[1]) > 0, zip(inputs, references)))


planning to utilize the plausible flag here in the future?

planning to utilize the plausible flag here in the future?

Sorry I didn't understand the question. plausible as in? The dataset right now is just basic functions. Maybe in the future, we could have our own dataset wrapper?

muthukumaranR

lgtm

NISH1001 · 2023-03-02T21:14:37Z

I have merged with some initial review. Will do another PR sometime this week which has a bit changed implementation in model initialization.

NISH1001 added 7 commits February 21, 2023 15:18

Add barebone unit test for jury format conversion for references

e44baa1

Add more tests for jury format conversion

93b66d6

Fix test function name

619f6f8

Add misc datasets loader module

0b3657a

See `evalem.misc.datasets`. We have `datasets.get_squad_v2(...)` function.

Add device parameter to DefaultQAWrapper

13f9c56

Update __init__ files with respective module imports

3ab89a2

Add metrics for computing semantics similarity

edbf0d1

Now we have `metrics.semantics.SemanticMetric`. There are 2 implementation for now: - `metrics.semantics.BertScore` - `metrics.semantics.BartScore`

NISH1001 requested review from xhagrg, code-geek and muthukumaranR February 27, 2023 17:22

NISH1001 mentioned this pull request Feb 27, 2023

[feature/improment] Add unit tests #5

Closed

4 tasks

muthukumaranR reviewed Mar 2, 2023

View reviewed changes

muthukumaranR approved these changes Mar 2, 2023

View reviewed changes

NISH1001 merged commit e0bbdd9 into main Mar 2, 2023

NISH1001 deleted the feature/add-test-implementation branch March 2, 2023 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alpha] Implementation of Semantic metrics and more basic metrics #6

[alpha] Implementation of Semantic metrics and more basic metrics #6

NISH1001 commented Feb 27, 2023 •

edited

Loading

muthukumaranR Mar 2, 2023

muthukumaranR Mar 2, 2023

NISH1001 Mar 2, 2023

muthukumaranR Mar 2, 2023

NISH1001 Mar 2, 2023

muthukumaranR Mar 2, 2023

NISH1001 Mar 2, 2023

muthukumaranR left a comment

NISH1001 commented Mar 2, 2023

[alpha] Implementation of Semantic metrics and more basic metrics #6

[alpha] Implementation of Semantic metrics and more basic metrics #6

Conversation

NISH1001 commented Feb 27, 2023 • edited Loading

Major Changes

Minor Changes

Usages

muthukumaranR Mar 2, 2023

Choose a reason for hiding this comment

muthukumaranR Mar 2, 2023

Choose a reason for hiding this comment

NISH1001 Mar 2, 2023

Choose a reason for hiding this comment

muthukumaranR Mar 2, 2023

Choose a reason for hiding this comment

NISH1001 Mar 2, 2023

Choose a reason for hiding this comment

muthukumaranR Mar 2, 2023

Choose a reason for hiding this comment

NISH1001 Mar 2, 2023

Choose a reason for hiding this comment

muthukumaranR left a comment

Choose a reason for hiding this comment

NISH1001 commented Mar 2, 2023

NISH1001 commented Feb 27, 2023 •

edited

Loading