Add support for custom evaluation metrics #190

brimoor · 2024-12-16T14:56:35Z

Setup

Make sure you are running the custom-metrics branch of your fiftyone install
If you haven't already, install fiftyone-plugins in editable mode:

git clone https://github.com/voxel51/fiftyone-plugins
cd fiftyone-plugins
git checkout --track origin/custom-metrics

ln -s "$(pwd)" "$(fiftyone config plugins_dir)/fiftyone-plugins"

Example usage

`example_metric`

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

#
# Option 1: App usage
#
# Launch the `evaluate_model` operator from the Operator Browser
# Scroll to the bottom of the modal to choose/configure custom metrics to apply
#
# Use the `delete_evaluation` operator to cleanup an evaluation
#

session = fo.launch_app(dataset)

#
# Option 2: SDK usage
#

# A custom metric to apply
metric = "@voxel51/metric-examples/example_metric"
kwargs = dict(value="spam")

results = dataset.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    custom_metrics={metric: kwargs},
)

print(dataset.count_values("eval_example_metric"))
results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_example_metric")

`absolute_error` and `squared_error`

Image dataset

import random
import numpy as np

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")
dataset.delete_sample_field("ground_truth")

for idx, sample in enumerate(dataset.iter_samples(autosave=True, progress=True), 1):
    ytrue = random.random() * idx
    ypred = ytrue + np.random.randn() * np.sqrt(ytrue)
    confidence = random.random()
    sample["ground_truth"] = fo.Regression(value=ytrue)
    sample["predictions"] = fo.Regression(value=ypred, confidence=confidence)

results = dataset.evaluate_regressions(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    custom_metrics=[
        "@voxel51/metric-examples/absolute_error",
        "@voxel51/metric-examples/squared_error",
    ],
)

print(dataset.bounds("eval_absolute_error"))
print(dataset.bounds("eval_squared_error"))

results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_absolute_error")
assert not dataset.has_field("eval_squared_error")

Video dataset

import random
import numpy as np

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-video")
dataset.delete_frame_field("detections")

for sample in dataset.iter_samples(autosave=True, progress=True):
    for frame_number, frame in sample.frames.items():
        ytrue = random.random() * frame_number
        ypred = ytrue + np.random.randn() * np.sqrt(ytrue)
        confidence = random.random()
        frame["ground_truth"] = fo.Regression(value=ytrue)
        frame["predictions"] = fo.Regression(
            value=ypred, confidence=confidence
        )

results = dataset.evaluate_regressions(
    "frames.predictions",
    gt_field="frames.ground_truth",
    eval_key="eval",
    custom_metrics=[
        "@voxel51/metric-examples/absolute_error",
        "@voxel51/metric-examples/squared_error",
    ],
)

print(dataset.mean("eval_absolute_error"))
print(dataset.mean("eval_squared_error"))
print(dataset.bounds("frames.eval_absolute_error"))
print(dataset.bounds("frames.eval_squared_error"))

results.print_metrics()

dataset.delete_evaluation("eval")

assert not dataset.has_field("eval_absolute_error")
assert not dataset.has_field("eval_squared_error")
assert not dataset.has_field("frames.eval_absolute_error")
assert not dataset.has_field("frames.eval_squared_error")

ritch · 2024-12-16T15:08:26Z

plugins/evaluation/__init__.py

@@ -212,6 +230,56 @@ def _get_evaluation_type(view, pred_field):
    return label_type, eval_type, methods


+def _add_custom_metrics(ctx, inputs, eval_type, method):
+    supported_metrics = []
+    for operator in foo.list_operators(type="operator"):


This really should just be baked in... having the contract defined here is not maintainable. Something like

foo.list_operators_with_metadata(type="operator", eval_type=eval_type, method=method)

Although the logic here should be colocated with the custom metric implementation, so that part is fine for now.

yes I agree, was just getting something working to start.

More generally, we should have a think about the concept of "operator classes" like EvaluationMetric(Operator) introduced here and how we want that to work holistically. For example, should the concept of tags be in fiftyone.yml rather than OperatorConfig so that we can access information about the type of operators without having to instantiate them?

I would prefer to keep an EvaluationMetric operator agnostic to eval_type and method here since certain metrics may not fall under any of the eval methods or may be applicable to more than one eval method.

To start with, the users should be free to choose whichever custom metrics they want and if the metric is not applicable to the eval method under consideration, it fails gracefully (which is already the case).

The implementation here does allow for eval_type and method to be optional.

If the EvaluationMetric defines them, then the model evaluation operator will only show the metric as an option in the dropdown if the user is executing the matching type/method evaluation.

If the EvaluationMetric does not define them, then the metric will always be included as an available metric in the dropdown.

Your design works fine for what we currently have as evaluation methods.

I was thinking about evaluation methods that don't fall under the ones currently available. For example, if I am evaluating a neural rendering model, I may want to reuse SquaredErrorMetric which is of type "regression" and also "SSIM" which is not of type "regression".

Ah so you want a way to declare that an evaluation metric should be available to multiple evaluation types, ie type=["regression", "neural-rendering"]?

That's certainly possible.

An alternative model would be that authors of metrics encode semantics of the metric is tags:

class MyEvaluationMetric(EvaluationMetric): @property def config(self): return foo.OperatorConfig( ... tags=["metric", "regression"], )

and then when new evaluation protocols are added, they have a way to declare that they support any custom metrics with certain tags(s):

def evaluate_neural_rendering(...): supported_metric_tags = ["regression", ...]

The alternative model looks good. I am assuming we will have a lot more custom metrics than evaluation methods so my preference is to let the evaluation method choose the metrics it supports.

A few comments for the final implementation:

We should require EvaluationMetric and isinstance(operator, EvaluationMetric)

We should define the EvaluationMetric specific configuration options in a EvaluationMetricConfig class

Instead of "tags" name the param metric_protocols to keep it specific and simple

Add absolute error and MAE as custom metrics

brimoor · 2025-01-21T05:09:38Z

@ritch okay I've implemented what you recommended here in these commits:

fiftyone-plugins: 727d204
fiftyone: voxel51/fiftyone@e98cea0

plugins/metric-examples/__init__.py

manushreegangwar · 2025-01-23T09:18:34Z

plugins/metric-examples/__init__.py

@@ -21,6 +21,8 @@ def config(self):
            name="example_metric",
            label="Example metric",
            description="An example evaluation metric",
+            aggregate_key="example",
+            unlisted=True,


What's unlisted for?

Operators that are unlisted=True will not appear in the Operator Browser.

All metrics should be marked as unlisted because they aren't intended for direct execution like that (which requires implementing the resolve_input() and execute() methods).

@ritch and I discussed wanting to have a way to automatically mark all EvaluationMetric operators as unlisted, but that's not implemented yet.

brimoor added the feature Work on a feature request label Dec 16, 2024

brimoor requested a review from manushreegangwar December 16, 2024 14:56

brimoor mentioned this pull request Dec 16, 2024

Add support for custom evaluation metrics voxel51/fiftyone#5279

Merged

ritch reviewed Dec 16, 2024

View reviewed changes

brimoor force-pushed the custom-metrics branch from 2bfe647 to ade1290 Compare December 20, 2024 14:30

add support for custom metrics

814c270

brimoor force-pushed the custom-metrics branch from ade1290 to 814c270 Compare December 21, 2024 17:17

manushreegangwar requested review from ehofesmann, jacobsela and mwoodson1 January 13, 2025 20:07

add support for custom metrics

83a0ed8

brimoor force-pushed the custom-metrics branch from 814c270 to 83a0ed8 Compare January 15, 2025 21:32

manushreegangwar and others added 11 commits January 16, 2025 16:23

Add absolute error and MAE as custom metrics

293efb9

Updates

f3a09e2

PR feedback and updates

baba9fc

Move get_fields

f3f8aad

Add compute_by_sample method

3924462

Minor updates

3e83c43

add lower_is_better and handle metric_tags differently

785301c

Merge branch 'custom-metrics' into manushree/custom-metrics

71fe421

remove lower_is_better from example metric

763cb67

Merge pull request #195 from voxel51/manushree/custom-metrics

fb38e4c

Add absolute error and MAE as custom metrics

finalizing

727d204

brimoor marked this pull request as ready for review January 21, 2025 05:08

brimoor requested a review from ritch January 21, 2025 05:09

manushreegangwar reviewed Jan 21, 2025

View reviewed changes

plugins/metric-examples/__init__.py Outdated Show resolved Hide resolved

brimoor requested a review from manushreegangwar January 22, 2025 05:18

manushreegangwar approved these changes Jan 22, 2025

View reviewed changes

use aggregate_key

c16bef7

manushreegangwar reviewed Jan 23, 2025

View reviewed changes

brimoor merged commit 057e645 into main Jan 23, 2025

brimoor deleted the custom-metrics branch January 23, 2025 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom evaluation metrics #190

Add support for custom evaluation metrics #190

brimoor commented Dec 16, 2024 •

edited

Loading

ritch Dec 16, 2024

ritch Dec 16, 2024

brimoor Dec 17, 2024 •

edited

Loading

manushreegangwar Jan 10, 2025

brimoor Jan 10, 2025 •

edited

Loading

manushreegangwar Jan 11, 2025 •

edited

Loading

brimoor Jan 11, 2025 •

edited

Loading

brimoor Jan 11, 2025 •

edited

Loading

manushreegangwar Jan 13, 2025

ritch Jan 15, 2025

brimoor commented Jan 21, 2025

manushreegangwar Jan 23, 2025

brimoor Jan 23, 2025

brimoor Jan 23, 2025 •

edited

Loading

Add support for custom evaluation metrics #190

Add support for custom evaluation metrics #190

Conversation

brimoor commented Dec 16, 2024 • edited Loading

Setup

Example usage

example_metric

absolute_error and squared_error

Image dataset

Video dataset

ritch Dec 16, 2024

Choose a reason for hiding this comment

ritch Dec 16, 2024

Choose a reason for hiding this comment

brimoor Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

manushreegangwar Jan 10, 2025

Choose a reason for hiding this comment

brimoor Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

manushreegangwar Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

brimoor Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

brimoor Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

manushreegangwar Jan 13, 2025

Choose a reason for hiding this comment

ritch Jan 15, 2025

Choose a reason for hiding this comment

brimoor commented Jan 21, 2025

manushreegangwar Jan 23, 2025

Choose a reason for hiding this comment

brimoor Jan 23, 2025

Choose a reason for hiding this comment

brimoor Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

brimoor commented Dec 16, 2024 •

edited

Loading

`example_metric`

`absolute_error` and `squared_error`

brimoor Dec 17, 2024 •

edited

Loading

brimoor Jan 10, 2025 •

edited

Loading

manushreegangwar Jan 11, 2025 •

edited

Loading

brimoor Jan 11, 2025 •

edited

Loading

brimoor Jan 11, 2025 •

edited

Loading

brimoor Jan 23, 2025 •

edited

Loading