[alpha] Improvements to ModelWrapper and better QA/Classification implementation #8

NISH1001 · 2023-03-02T21:56:17Z

Major Changes

evalem.models.QuestionAnsweringHFPipelineWrapper and evalem.models.TextClassificationHFPipelineWrapper are now the main wrappers for QA and Text Classification tasks respectively.
- These also have better parameter initialization, allowing any suitable models and tokenizers to be used along with device types.
- hf_params dict is also provided as a parameter that will be used for initializing the HF pipeline
evalem.evaluators.TextClassificationEvaluator has been added with basics metrics for text classification (F1 score, precision, recall, confusion matrix)
evalem.models._base.ModelWrapper now utilizes 2 distinct processing parameters (one for pre-preocessing and one for post-processing) which should be Callable (lambda function, external modules that can be called, etc.)
- inputs_preprocessor is used for working on input dataset and change inputs to the models.
  - If not provided, default ModelWrapper._preprocess_inputs method is used that can also be overridden by any downstream sub-class
- predictions_postprocessor is used on model's predictions to do post processing on predictions.
  - If not provided, the default ModelWrapper._postprocess_predictions method is used that can also be overridden by any downstream sub-class

Minor Changes

evalem.models.DefaultQAModelWrapper has been deprecated. User will get DeprecationWarning error when trying to initialize the object
evalem.metrics.BertScore now uses bert-base-uncased as the default model instead of roberta-large.
evalem.misc.datasets.get_imdb function is added to load IMDB dataset out-of-box.

Usage

QA Task

Defaults

from evalem.evaluators import QAEvaluator
from evalem.models import QuestionAnsweringHFPipelineWrapper
from evalem.misc.datasets import get_squad_v2

from pprint import pprint

def run_pipeline(
    model: ModelWrapper,
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
#     return evaluators.metrics[1](predictions=predictions, references=references)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

# initialize evaluator
evaluators = QAEvaluator()

# create model wrapper
wrapped_model = QuestionAnsweringHFPipelineWrapper()

# load data
data = get_squad_v2(nsamples=10)
inputs = data["inputs"]
references = data["references"]

results = run_pipeline(wrapped_model, evaluators, inputs, references)
pprint(results)

Using the custom model and post-processing functionality

from evalem.evaluators import QAEvaluator
from evalem.models import QuestionAnsweringHFPipelineWrapper
from evalem.misc.datasets import get_squad_v2

from pprint import pprint

def run_pipeline(
    model: ModelWrapper,
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
#     return evaluators.metrics[1](predictions=predictions, references=references)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

# initialize evaluator
evaluators = QAEvaluator()

# create a model wrapper directly using HF's pipeline object
# along with external post processor 
wrapped_model = HFPipelineWrapper(
     pipeline("question-answering", model="deepset/roberta-base-squad2"),
     predictions_postprocessor=lambda xs: list(map(lambda x: x["answer"], xs))
 )

# load data
data = get_squad_v2(nsamples=10)
inputs = data["inputs"]
references = data["references"]

results = run_pipeline(wrapped_model, evaluators, inputs, references)
pprint(results)

Text Classification

defaults

from evalem.evaluators import TextClassificationEvaluator
from evalem.models import TextClassificationHFPipelineWrapper
from evalem.misc.datasets import get_imdb

from pprint import pprint

def run_pipeline(
    model: ModelWrapper,
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
#     return evaluators.metrics[1](predictions=predictions, references=references)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

# initialize evaluator
evaluators = TextClassificationEvaluator()

# create a model wrapper
wrapped_model = TextClassificationHFPipelineWrapper(
    hf_params=dict(truncation=True)
)

# load data
data = get_imdb(nsamples=10)
inputs = data["inputs"]
references = data["references"]

results = run_pipeline(wrapped_model, evaluators, inputs, references)
pprint(results)

customized

from evalem.evaluators import TextClassificationEvaluator
from evalem.models import TextClassificationHFPipelineWrapper
from evalem.misc.datasets import get_imdb

from pprint import pprint

def run_pipeline(
    model: ModelWrapper,
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
#     return evaluators.metrics[1](predictions=predictions, references=references)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

# initialize evaluator
evaluators = TextClassificationEvaluator()

# create a model wrapper
wrapped_model = HFPipelineWrapper(
    pipeline("text-classification", truncation=True),
    predictions_postprocessor=lambda xs: list(map(lambda x: x["label"], xs))
)

# load data
data = get_imdb(nsamples=10)
inputs = data["inputs"]
references = data["references"]

results = run_pipeline(wrapped_model, evaluators, inputs, references)
pprint(results)

See `evalem.misc.datasets`. We have `datasets.get_squad_v2(...)` function.

Now we have `metrics.semantics.SemanticMetric`. There are 2 implementation for now: - `metrics.semantics.BertScore` - `metrics.semantics.BartScore`

…rics-models

We make use of 2 kwargs to any model wrapper: - `inputs_preprocessor` (maps inputs to a specific format, defaults to identity) - `predictions_postprocessor` (maps model outputs to a specific format, defaults to identity) Also `models.HFPipelineWrapperForQuestionAnswering` is created. `models.DefaultQAModelWrapper` is deprecated.

See `models.defaults.TextClassificationHFPipelineWrapper`. Also improve the concstruction of hf pipeline object in existing wrapper. `evaluators.basics.TextClassificationEvaluator` is also added.

This flag is used to return precision/recall/f1 score per prediction instance.

…rics-models Fix merge conflicts

TextClassificationHFPipelineWrapper Previously, tokenizer was set to some defaults. However, that is incorrect. We want tokenizer to be the one for which provided model was trained on. So, now `tokenizer` is set to None by default.

evalem/evaluators/basics.py

evalem/metrics/semantics.py

xhagrg · 2023-03-07T15:07:02Z

evalem/misc/datasets.py

+    """
+    nsamples = nsamples or 0
+    data = load_dataset("imdb")[data_type]
+    data = data.shuffle(seed=42) if shuffle else data


move seed to a config or a constant.

Ah ya. Good call. The framework-level config could be a nice way to manage these seeds.

Can I resolve this in the next PR? It doesn't hamper the behavior of the framework at this point.

evalem/misc/datasets.py

NISH1001 added 14 commits February 21, 2023 15:18

Add barebone unit test for jury format conversion for references

e44baa1

Add more tests for jury format conversion

93b66d6

Fix test function name

619f6f8

Add misc datasets loader module

0b3657a

See `evalem.misc.datasets`. We have `datasets.get_squad_v2(...)` function.

Add device parameter to DefaultQAWrapper

13f9c56

Update __init__ files with respective module imports

3ab89a2

Add metrics for computing semantics similarity

edbf0d1

Now we have `metrics.semantics.SemanticMetric`. There are 2 implementation for now: - `metrics.semantics.BertScore` - `metrics.semantics.BartScore`

Merge branch 'main' of github.com:NASA-IMPACT/evalem into feature/met…

09239f3

…rics-models

Update code documentation for HFPipelineWrapper docstring

7aee9f7

Implement HF pipeline wrapper/evaluator for text classification

2ba14a9

See `models.defaults.TextClassificationHFPipelineWrapper`. Also improve the concstruction of hf pipeline object in existing wrapper. `evaluators.basics.TextClassificationEvaluator` is also added.

Add per_instance_score flag to BertScore metric

a379446

This flag is used to return precision/recall/f1 score per prediction instance.

Merge branch 'main' of github.com:NASA-IMPACT/evalem into feature/met…

a34165a

…rics-models Fix merge conflicts

Default to bert-base-uncased for BertScore

00e0150

NISH1001 requested review from xhagrg, code-geek and muthukumaranR March 2, 2023 21:56

xhagrg reviewed Mar 7, 2023

View reviewed changes

evalem/evaluators/basics.py Show resolved Hide resolved

xhagrg reviewed Mar 7, 2023

View reviewed changes

evalem/metrics/semantics.py Outdated Show resolved Hide resolved

xhagrg reviewed Mar 7, 2023

View reviewed changes

evalem/misc/datasets.py Show resolved Hide resolved

Change BertScore's per_instance_score behaviour to compute mean

8e62cc7

NISH1001 merged commit 6e3c4a6 into main Mar 10, 2023

NISH1001 deleted the feature/metrics-models branch March 10, 2023 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alpha] Improvements to ModelWrapper and better QA/Classification implementation #8

[alpha] Improvements to ModelWrapper and better QA/Classification implementation #8

NISH1001 commented Mar 2, 2023

xhagrg Mar 7, 2023

NISH1001 Mar 8, 2023

NISH1001 Mar 10, 2023

[alpha] Improvements to ModelWrapper and better QA/Classification implementation #8

[alpha] Improvements to ModelWrapper and better QA/Classification implementation #8

Conversation

NISH1001 commented Mar 2, 2023

Major Changes

Minor Changes

Usage

QA Task

Defaults

Using the custom model and post-processing functionality

Text Classification

defaults

customized

xhagrg Mar 7, 2023

Choose a reason for hiding this comment

NISH1001 Mar 8, 2023

Choose a reason for hiding this comment

NISH1001 Mar 10, 2023

Choose a reason for hiding this comment