-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alpha] Implementation of Semantic metrics and more basic metrics #6
Conversation
See `evalem.misc.datasets`. We have `datasets.get_squad_v2(...)` function.
Now we have `metrics.semantics.SemanticMetric`. There are 2 implementation for now: - `metrics.semantics.BertScore` - `metrics.semantics.BartScore`
|
||
def __init__( | ||
self, | ||
model_type: str = "roberta-large", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest default be the ubiquitous bert-base-uncased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even the distilled version as you pointed out in the example distilbert-base-uncased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even the distilled version as you pointed out in the example
distilbert-base-uncased
Ah ya... Good catch. Actually the default Jury's was roberta-large. So, I think I went with that. But I agree to set to bert as a default (BERT for the win..haha)
def __init__(self, device: str = "cpu", debug: bool = False) -> None: | ||
super().__init__(debug=debug) | ||
self.device = device | ||
|
||
@abstractmethod | ||
def compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: evaluate()
seems befitting to compute()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud:
evaluate()
seems befitting tocompute()
Here, I actually wanted to differentiate Evaluator.evaluate(...)
and Metric.compute(...)
methods. Didn't want to replicate the same interface name to 2 different components. Do you think we should have duplicate interface names?
data = load_dataset("squad_v2")[data_type] | ||
data = data.shuffle(seed=42) if shuffle else data | ||
data = data.select(range(nsamples)) if nsamples > 0 else data | ||
|
||
inputs = [dict(question=d["question"], context=d["context"]) for d in data] | ||
references = [d["answers"]["text"] for d in data] | ||
|
||
inputs, references = zip(*filter(lambda x: len(x[1]) > 0, zip(inputs, references))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning to utilize the plausible
flag here in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning to utilize the
plausible
flag here in the future?
Sorry I didn't understand the question. plausible as in? The dataset right now is just basic functions. Maybe in the future, we could have our own dataset wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I have merged with some initial review. Will do another PR sometime this week which has a bit changed implementation in model initialization. |
Major Changes
metrics.semantics.SemanticMetric
type is added from which we have 2 implementations:BertScore
andBartScore
metrics.basics.ExactMatchMetric
: Flattens all the references (in the multi-reference scenario) and performs an exact matchmetrics.basics.ConfusionMatrix
: Generates confusion matrix for the provided set of labels (obtained from both references and predictions) using scikit-learntests/
. (For now, it only tests for theformat_to_jury
function (it's very basic for now).Minor Changes
models.defaults.DefaultQAModelWrapper
now accepts thedevice
param to run on CPU or GPUmodels._base.HFPipelineWrapper
has apipeline
property that returns the pipelineevalem.misc.datasets.get_squad_v2(...)
utility function is added to load squad-v2 datasetUsages