Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alpha] Implementation of Semantic metrics and more basic metrics #6

Merged
merged 7 commits into from
Mar 2, 2023

Conversation

NISH1001
Copy link
Collaborator

@NISH1001 NISH1001 commented Feb 27, 2023

Major Changes

  • metrics.semantics.SemanticMetric type is added from which we have 2 implementations: BertScore and BartScore
  • two new basic metrics are added:
    • metrics.basics.ExactMatchMetric: Flattens all the references (in the multi-reference scenario) and performs an exact match
    • metrics.basics.ConfusionMatrix: Generates confusion matrix for the provided set of labels (obtained from both references and predictions) using scikit-learn
  • initial unit tests (based on pytest) are added at tests/. (For now, it only tests for the format_to_jury function (it's very basic for now).

Minor Changes

  • models.defaults.DefaultQAModelWrapper now accepts the device param to run on CPU or GPU
  • models._base.HFPipelineWrapper has a pipeline property that returns the pipeline
  • evalem.misc.datasets.get_squad_v2(...) utility function is added to load squad-v2 dataset

Usages

from evalem.structures import (
    PredictionDTO,
    ReferenceDTO,
    EvaluationDTO,
    PredictionInstance,
    ReferenceInstance
)

from evalem.metrics import (
    Metric,
    AccuracyMetric,
    PrecisionMetric,
    RecallMetric,
    F1Metric,
    ConfusionMatrix,
    ExactMatchMetric,
    BertScore,
    BartScore,
)

from typing import Iterable, Type, Mapping, Union, List

from evalem.models import (
    DefaultQAModelWrapper,
    HFPipelineWrapper,
    ModelWrapper
)

from evalem.misc.datasets import get_squad_v2

# wrapped_model = HFPipelineWrapper(
#     pipeline("question-answering"),
# )

wrapped_model = DefaultQAModelWrapper(device="cpu")

def run_pipeline(
    model: Type[ModelWrapper],
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

data = get_squad_v2("validation", nsamples=100)

evaluators = [
    Evaluator(metrics=[
        AccuracyMetric(),
        ConfusionMatrix(),
        ExactMatchMetric(),
    ]),
    Evaluator(metrics=[
        BertScore(device="mps", model_type="distilbert-base-uncased"),
        BartScore(device="mps")
    ])
]

results = run_pipeline(
    wrapped_model,
    evaluators,
    data["inputs"],
    data["references"]
)


def __init__(
self,
model_type: str = "roberta-large",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest default be the ubiquitous bert-base-uncased

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even the distilled version as you pointed out in the example distilbert-base-uncased

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even the distilled version as you pointed out in the example distilbert-base-uncased

Ah ya... Good catch. Actually the default Jury's was roberta-large. So, I think I went with that. But I agree to set to bert as a default (BERT for the win..haha)

def __init__(self, device: str = "cpu", debug: bool = False) -> None:
super().__init__(debug=debug)
self.device = device

@abstractmethod
def compute(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: evaluate() seems befitting to compute()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: evaluate() seems befitting to compute()

Here, I actually wanted to differentiate Evaluator.evaluate(...) and Metric.compute(...) methods. Didn't want to replicate the same interface name to 2 different components. Do you think we should have duplicate interface names?

Comment on lines +33 to +40
data = load_dataset("squad_v2")[data_type]
data = data.shuffle(seed=42) if shuffle else data
data = data.select(range(nsamples)) if nsamples > 0 else data

inputs = [dict(question=d["question"], context=d["context"]) for d in data]
references = [d["answers"]["text"] for d in data]

inputs, references = zip(*filter(lambda x: len(x[1]) > 0, zip(inputs, references)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to utilize the plausible flag here in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to utilize the plausible flag here in the future?

Sorry I didn't understand the question. plausible as in? The dataset right now is just basic functions. Maybe in the future, we could have our own dataset wrapper?

Copy link
Collaborator

@muthukumaranR muthukumaranR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@NISH1001 NISH1001 merged commit e0bbdd9 into main Mar 2, 2023
@NISH1001
Copy link
Collaborator Author

NISH1001 commented Mar 2, 2023

I have merged with some initial review. Will do another PR sometime this week which has a bit changed implementation in model initialization.

@NISH1001 NISH1001 deleted the feature/add-test-implementation branch March 2, 2023 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants