Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom evaluation metrics #190

Merged
merged 14 commits into from
Jan 23, 2025
Merged

Add support for custom evaluation metrics #190

merged 14 commits into from
Jan 23, 2025

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Dec 16, 2024

Setup

  1. Make sure you are running the custom-metrics branch of your fiftyone install
  2. If you haven't already, install fiftyone-plugins in editable mode:
git clone https://github.com/voxel51/fiftyone-plugins
cd fiftyone-plugins
git checkout --track origin/custom-metrics

ln -s "$(pwd)" "$(fiftyone config plugins_dir)/fiftyone-plugins"

Example usage

example_metric

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

#
# Option 1: App usage
#
# Launch the `evaluate_model` operator from the Operator Browser
# Scroll to the bottom of the modal to choose/configure custom metrics to apply
#
# Use the `delete_evaluation` operator to cleanup an evaluation
#

session = fo.launch_app(dataset)

#
# Option 2: SDK usage
#

# A custom metric to apply
metric = "@voxel51/metric-examples/example_metric"
kwargs = dict(value="spam")

results = dataset.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    custom_metrics={metric: kwargs},
)

print(dataset.count_values("eval_example_metric"))
results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_example_metric")

absolute_error and squared_error

Image dataset

import random
import numpy as np

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")
dataset.delete_sample_field("ground_truth")

for idx, sample in enumerate(dataset.iter_samples(autosave=True, progress=True), 1):
    ytrue = random.random() * idx
    ypred = ytrue + np.random.randn() * np.sqrt(ytrue)
    confidence = random.random()
    sample["ground_truth"] = fo.Regression(value=ytrue)
    sample["predictions"] = fo.Regression(value=ypred, confidence=confidence)

results = dataset.evaluate_regressions(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    custom_metrics=[
        "@voxel51/metric-examples/absolute_error",
        "@voxel51/metric-examples/squared_error",
    ],
)

print(dataset.bounds("eval_absolute_error"))
print(dataset.bounds("eval_squared_error"))

results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_absolute_error")
assert not dataset.has_field("eval_squared_error")

Video dataset

import random
import numpy as np

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-video")
dataset.delete_frame_field("detections")

for sample in dataset.iter_samples(autosave=True, progress=True):
    for frame_number, frame in sample.frames.items():
        ytrue = random.random() * frame_number
        ypred = ytrue + np.random.randn() * np.sqrt(ytrue)
        confidence = random.random()
        frame["ground_truth"] = fo.Regression(value=ytrue)
        frame["predictions"] = fo.Regression(
            value=ypred, confidence=confidence
        )

results = dataset.evaluate_regressions(
    "frames.predictions",
    gt_field="frames.ground_truth",
    eval_key="eval",
    custom_metrics=[
        "@voxel51/metric-examples/absolute_error",
        "@voxel51/metric-examples/squared_error",
    ],
)

print(dataset.mean("eval_absolute_error"))
print(dataset.mean("eval_squared_error"))
print(dataset.bounds("frames.eval_absolute_error"))
print(dataset.bounds("frames.eval_squared_error"))

results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_absolute_error")
assert not dataset.has_field("eval_squared_error")
assert not dataset.has_field("frames.eval_absolute_error")
assert not dataset.has_field("frames.eval_squared_error")

@@ -212,6 +230,56 @@ def _get_evaluation_type(view, pred_field):
return label_type, eval_type, methods


def _add_custom_metrics(ctx, inputs, eval_type, method):
supported_metrics = []
for operator in foo.list_operators(type="operator"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really should just be baked in... having the contract defined here is not maintainable. Something like

foo.list_operators_with_metadata(type="operator", eval_type=eval_type, method=method)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the logic here should be colocated with the custom metric implementation, so that part is fine for now.

Copy link
Contributor Author

@brimoor brimoor Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I agree, was just getting something working to start.

More generally, we should have a think about the concept of "operator classes" like EvaluationMetric(Operator) introduced here and how we want that to work holistically. For example, should the concept of tags be in fiftyone.yml rather than OperatorConfig so that we can access information about the type of operators without having to instantiate them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep an EvaluationMetric operator agnostic to eval_type and method here since certain metrics may not fall under any of the eval methods or may be applicable to more than one eval method.

To start with, the users should be free to choose whichever custom metrics they want and if the metric is not applicable to the eval method under consideration, it fails gracefully (which is already the case).

Copy link
Contributor Author

@brimoor brimoor Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here does allow for eval_type and method to be optional.

If the EvaluationMetric defines them, then the model evaluation operator will only show the metric as an option in the dropdown if the user is executing the matching type/method evaluation.

If the EvaluationMetric does not define them, then the metric will always be included as an available metric in the dropdown.

Copy link
Contributor

@manushreegangwar manushreegangwar Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your design works fine for what we currently have as evaluation methods.

I was thinking about evaluation methods that don't fall under the ones currently available. For example, if I am evaluating a neural rendering model, I may want to reuse SquaredErrorMetric which is of type "regression" and also "SSIM" which is not of type "regression".

Copy link
Contributor Author

@brimoor brimoor Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so you want a way to declare that an evaluation metric should be available to multiple evaluation types, ie type=["regression", "neural-rendering"]?

That's certainly possible.

Copy link
Contributor Author

@brimoor brimoor Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative model would be that authors of metrics encode semantics of the metric is tags:

class MyEvaluationMetric(EvaluationMetric):
    @property
    def config(self):
        return foo.OperatorConfig(
            ...
            tags=["metric", "regression"],
        )

and then when new evaluation protocols are added, they have a way to declare that they support any custom metrics with certain tags(s):

def evaluate_neural_rendering(...):
    supported_metric_tags = ["regression", ...]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative model looks good. I am assuming we will have a lot more custom metrics than evaluation methods so my preference is to let the evaluation method choose the metrics it supports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments for the final implementation:

  1. We should require EvaluationMetric and isinstance(operator, EvaluationMetric)
  2. We should define the EvaluationMetric specific configuration options in a EvaluationMetricConfig class
  3. Instead of "tags" name the param metric_protocols to keep it specific and simple

@brimoor brimoor marked this pull request as ready for review January 21, 2025 05:08
@brimoor
Copy link
Contributor Author

brimoor commented Jan 21, 2025

@ritch okay I've implemented what you recommended here in these commits:

@brimoor brimoor requested a review from ritch January 21, 2025 05:09
@@ -21,6 +21,8 @@ def config(self):
name="example_metric",
label="Example metric",
description="An example evaluation metric",
aggregate_key="example",
unlisted=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's unlisted for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators that are unlisted=True will not appear in the Operator Browser.

All metrics should be marked as unlisted because they aren't intended for direct execution like that (which requires implementing the resolve_input() and execute() methods).

Copy link
Contributor Author

@brimoor brimoor Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritch and I discussed wanting to have a way to automatically mark all EvaluationMetric operators as unlisted, but that's not implemented yet.

@brimoor brimoor merged commit 057e645 into main Jan 23, 2025
@brimoor brimoor deleted the custom-metrics branch January 23, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants