Add support for stateful metrics. #9253

fchollet · 2018-01-31T00:56:13Z

cc @Dref360 @brge17: please check that it looks satisfactory (in particular the UX). Check out the BinaryTruePositives example class in metrics_test.py.

brge17 · 2018-01-31T01:20:43Z

The progbar does not correctly display training metrics (it still does averaging under the hood). Not sure if this is something in scope for the initial PR. Callbacks that reference the training metrics do receive the correct value, it is purely cosmetic.

Otherwise looks good to me.

fchollet · 2018-01-31T01:25:26Z

In general I am not happy with the fact that we have a special handling of stateful metrics in callbacks (and would need the same in the progbar as well). Could we come up with a simple and elegant design to support both stateful metrics and samplewise metrics, I wonder?

An obvious solution would be to cast all metrics as stateful, but that would not be backwards compatible with user-written metrics. chin scratch emoji

brge17 · 2018-01-31T01:29:19Z

Option 2: I thought of was if the metric is non-stateful write a wrapper that makes it stateful.

That way the losses that double as metrics don't need a loss and metric implementation

fchollet · 2018-01-31T01:31:59Z

That's a good idea:

metric is a layer: assume it's stateful
it's a function: wrap it into a layer that does the averaging

But is it a good UX?

Dref360 · 2018-01-31T01:38:02Z

keras/engine/training.py

+            # Reset stateful metrics
+            for m in self.metrics:
+                if isinstance(m, Layer):
+                    m.reset_states()


Wouldn't this lead to cryptic error if someone forgets to implement reset_states?

The only assumption here is that m is a Layer. And there is no way of knowing what is a stateful metric. I would like to see a compromise between your approach and brge's approach.

class StatefulMetric(Layer): def reset_states(): raise NotImplementedError

As a user, what can I do, knowing that I can now feed a Layer to the metrics arguments? Can I feed any Layer? (No, but some will try)

I prefer the middle ground as @Dref360 mentioned.

But, I'm fine either way.

brge17 · 2018-01-31T01:49:08Z

From a UX prospective, if you have an non-stateful metric (no change) it's handled under the hood. Without digging into the code - no wiser.

Now, if you are in the other camp and need a stateful metric this is great UX because you don't have to write a really hack callback (which is what I had previously done).

That's my 2 cents.

fchollet · 2018-01-31T02:08:27Z

So to sum up, we could have a world where:

callbacks and progbar do no averaging (and are agnostic to how metrics work)
there's a Metric class, which is stateful
metric functions get wrapped into a SamplewiseMetric subclass which does the averaging (returns sum(val * batch) / samples_see_so_far, like we do now)
there's a unified API for interacting with metrics in training.py

Let's look into it?

fchollet · 2018-01-31T02:43:53Z

For progbar backwards compatibility, averaging should be made an option (with old behavior being the default).

brge17 · 2018-01-31T04:17:07Z

Couple things:

CNTK tests are failing it looks like.

Is this just for the loss?:

For progbar backwards compatibility, averaging should be made an option (with old behavior being the default).

Otherwise:

metric functions get wrapped into a SamplewiseMetric subclass which does the averaging (returns sum(val * batch) / samples_see_so_far, like we do now)

eliminates the need.

I don't have any free cycles left this week. But, I would be happy to work on this next week.

brge17 · 2018-01-31T04:25:17Z

The Theano StatefulMetric computing incorrectly is perplexing...

fchollet · 2018-02-01T19:39:21Z

The Theano StatefulMetric computing incorrectly is perplexing...

Weird, especially since the test passes with Theano for me locally...

fchollet · 2018-02-01T19:43:02Z

The test failure with Theano is non-deterministic (depends on the training data). This does not happen with TF.

fchollet · 2018-02-01T19:48:24Z

It's a graph dependency issue; true_positives may be returned before or after the last update has been run. Does not happen with TF.

fchollet · 2018-02-01T19:53:13Z

Fixed by not directly returning the variable being updated. It's kind of a subtle issue. Not happening with TF because we deliberately set updates in the backend to be run after the outputs (in Function).

Dref360 · 2018-02-01T19:54:53Z

Is there a way to fix it in theano_backend? Otherwise, users will make the same error.

fchollet · 2018-02-01T21:33:30Z

@Dref360 is there a Theano API to specify the order in which ops are to be executed? Similar to control_dependencies in TF. If yes, we can do it. If not, never mind. Theano is EOL anyway.

fchollet · 2018-02-01T22:24:34Z

A complication is that current callbacks expect to be passed batch-wise metrics in on_batch_end, rather than current averages. In that regard, the plan above would be a breaking behavior change.

fchollet · 2018-02-01T22:27:46Z

This behavior change isn't necessarily a huge deal; for instance no built-in callback meaningfully leverages batch metrics today. Any user of that would be an advanced user that could deal with the change.

I think we will merge this PR (which doesn't affect any current behavior or API, other than enabling the workflow described in the unit test) then investigate the plan described above.

fchollet · 2018-02-01T23:15:41Z

There's a complex interaction with losses, that makes it impossible to make all metrics stateful (losses, including the total loss, being metrics too).

The only way forward as I see it is to have a formal distinction between samplewise metrics and stateful metrics, and to hand down this information to the BaseLogger and Progbar in a clean way.

fchollet · 2018-02-01T23:34:58Z

One more reason is that going all-stateful would break the output of train_on_batch, test_on_batch (which are supposed to return batch-wise metrics). This is an API that quite a few people use.

brge17 · 2018-02-02T00:17:12Z

Sounds good to me.

The only people who are going to use stateful metrics in the short term also know that it doesn't work with the progbar yet. But they do work with the other callbacks (which is more important).

…ar & compile

fchollet · 2018-02-02T01:38:38Z

Added clean support for logging stateful metrics. Now properly handled by the progbar. Also did some refactoring while I was at it. PTAL.

brge17 · 2018-02-02T05:12:46Z

Very nice. The updated progbar is a nice touch.

I double checked a few things locally:

Train metrics in the progress bar at the end of the epoch match the logs and values passed to other callbacks.
Validation metrics in the progress bar at the end of the epoch match the logs and values passed to other callbacks.
Multiple stateful metrics of the same name work.
Adding params works.
Multiple stateful metrics of the same name with different params work.

This is very exciting :)

ahundt · 2018-02-05T05:28:45Z

This discussion more or less took place in: #9200 #8657

@brge17 Thanks for the issue links, I hadn't seen them!

@fchollet two stateful metrics API UX questions:

Can tf.metrics and tf.contrib.metrics mentioned in the linked discussion easily be adapted to the proposed API, particularly the streaming_ versions?
Can we design the stateful metrics API UX to look like a streaming statistical API such as boltons.statsutils?

Item 2 is what I was really thinking of and it is a much clearer example than when I posted linking tqdm.

briannemsick · 2018-02-05T05:37:56Z

That's the whole point of this PR is so you can support an arbitrary metric.

With stateful metrics, you can compute any function of y_pred, y_true.

That's why I'm so desperatly trying to get it through. (@brge17 is my work github).

We can't discuss specific stateful metrics, until we have support for the general case. (Can't run before we can walk).

Step 1 is support stateful metrics (this PR), step 2 is write stateful metrics for tf.metrics/common users wants. (follow on PRs). At the very least users can write their own in the mean time.

This API is functionally identical to how TensorFlow metrics work under the hood. We are replicating that functionality.

briannemsick · 2018-02-05T05:44:09Z

Here's an example to stress point 1.

Say you want to have a metric like True Positive Rate:

Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

Option 2: (Naive brute force) cache all predictions and recompute the metric every batch.

The tf.metrics you referenced always do 1 (or option 3 an approximate metric that is more compute/memory efficient see the tf AUC implementation). Which is how we will implement AUC/Precision/Recall/Confusion Matrix, etc... The user always has the option to do 2 (although super inefficient an a huge waste of memory).

As it sits without this PR, you have no options to use metrics as is because they only batchwise averages. The solution to get around it is inefficient/hacky code in a custom callback.

As the PR currently sits, no changes if you are using non-stateful metrics. It just enables the future development of stateful metrics.

ahundt · 2018-02-05T06:08:20Z

Here's an example to stress point 1.
Say you want to a metric like True Positive Rate:
Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

tf.contrib.metrics.streaming_true_positives does Option 1, see the functions prefixed with streaming_. We're on the same page here, I program robots which are basically real time streaming data sources. :-)

That's why I'm so desperately trying to get it through.

@briannemsick I certainly don't want to put you in a bind, could you check out this PR now locally then update to the final API when it is released?

François did specifically ask for UX review:

please check that it looks satisfactory (in particular the UX).

fchollet · 2018-02-06T00:45:40Z

ask for UX review

The progbar is a purely internal API and thus not part of the UX of this feature. The UX is basically just the experience of writing stateful metrics, and the consistency of what we log on screen.

ahundt · 2018-02-06T01:38:54Z

The progbar is a purely internal API and thus not part of the UX of this feature. The UX is basically just the experience of writing stateful metrics, and the consistency of what we log on screen.

@fchollet I promise I'm not trying to waste time discussing the progbar! I'm very sorry I originally mixed in another topic.

I'm actually trying to propose a composable stateful metric design based on statistical streaming APIs, it should only be a slight tweak. Please see #9253 (comment) which is up three.

dfridovi

I really like this idea! Supporting stateful metrics like this is going to be really useful for applications where pure classification errors like cross entropy aren't enough. I also think the interface is pretty clean.

brge17 · 2018-02-06T02:35:01Z

@ahundt

The API we are proposing is equivalent and consistent with how these Stateful Metrics are implemented in TensorFlow.
The API you are proposing is un-Keras-like. Besides the properties additions (that should not be required), the proposed change is a rename of core functions that are shared between metrics, losses, and layers. update is the same as __call__ , clear is the same as reset_states. It does not make sense to have the API inconsistent with the rest of Keras.

Can we get a review from @Kritikalcoder, @ozabluda, or @hasded

or can we...:

I think we will merge this PR (which doesn't affect any current behavior or API, other than enabling the workflow described in the unit test) then investigate the plan described above.

ozabluda · 2018-02-06T03:13:51Z

Sorry for the monotonous drone, and sorry for not being able to really follow this (and related) discussions until I can understand the following:

Would we/user be able to just plug in arbitrary TF and sklearn metrics (see comments below), almost all of which have signature blah(labels, predictions)? Or we/user have to reimplement them all forever.
... for example confusion matrix, which is my first UX test case to make sure it all works OK.

#8657 (comment)
#8657 (comment)

brge17 · 2018-02-06T03:20:35Z

@ozabluda

The answer is no, but that's also the same answer in TensorFlow.

The reason is, to wrap any* arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions which is memory inefficient and poor design.

This will get you the confusion matrix, because I will be personally implementing them (and I will post them publicly). But, they will carefully be written to store a state and update the function accurately without caching every prediction.

It may be unsatisfying, but the simple easy solution is compute and memory inefficient/not production grade as @fchollet highlighted in #8657.

This is a step forward :/

The benchmark in my mind is that currently you can't implement them even if you wanted too...

ahundt · 2018-02-06T03:38:08Z

What do you think of the following?

tp = BinaryTruePositive()
fp = BinaryFalsePositive()
tpfp = Add()(tp, fp)
precision = Divide()(tp, tpfp)
recall = Lambda(lambda x, y: tf.contrib.metrics.streaming_recall(x, y))

# Test on simple model
inputs = keras.Input(shape=(2,))
outputs = keras.layers.Dense(1, activation='sigmoid')(inputs)
model = keras.Model(inputs, outputs)
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['acc', precision, recall])

The reason is, to wrap an arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions.

This assertion isn't correct for streaming algorithms.

brge17 · 2018-02-06T03:40:26Z

Say

you want to have a metric like True Positive Rate:

Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

Option 2: (Naive brute force) cache all predictions and recompute the metric every batch.

It should say any not an.

ozabluda · 2018-02-06T03:48:32Z

The reason is, to wrap an arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions which is memory inefficient and poor design.

How about a simpler question for starters: TF metrics only. If TF can do it, so can Keras, right? Also, TF can optionally put them into CPU RAM, right?

Now sklearn metrics. Those have to be accumulated on CPU anyway. Again, if sklearn can do it, so can Keras, right?

If a user runs out of memory for either one of those, it's no different than running out of mem due to minibatch being too large or whatever. Then you solve/optimize it, but not prematurely.

Some metrics can be computed incrementally, batch-by-batch. Confusion matrix is one of those, as is the myriad of metrics that follow from it. So is ROC/AUC. TF metrics docs do often mention "estimation of the metric over a stream of data ...", which should be accommodated.

ahundt · 2018-02-06T04:05:28Z

It's like we designed the API to replicate what they did under the hood?

Yes, that's a good idea. We should do it all under the hood sticking as closely as possible to existing Keras conventions, assuming that's viable.

brge17 · 2018-02-06T04:22:28Z

@ozabluda

TensorFlow has stateful metrics because they support stateful metrics. They have reset_ops, they have updates.

TensorFlow does not take any arbitrary metric of the formmy_metric(y_true, y_pred). They have pre-built polished metrics that have efficient stateful representations and require the user to write their own otherwise.

Sklearn is neither a deep learning library nor computes in batches, so that's an unfair comparison.

This PR enables all the TensorFlow streaming metrics, they just are not all implemented yet. In the tests, there is truepositives. With that as a template you can implement the confusion matrix, precision, recall, auc, etc... or wait for the follow on PR to this one where we implement the TF streaming metrics.

brge17 · 2018-02-06T04:35:40Z

@ozabluda

How about a simpler question for starters: TF metrics only. If TF can do it, so can Keras, right?

See below, what you are claiming it does implement, it can implement if the user writes tensor operations for a specific metric.

. TF metrics docs do often mention "estimation of the metric over a stream of data ...", which should be accommodated.

That's verbatim what the PR allows you to do.

The metric receives the data from the batch, the state is updated, and each epoch it is auto-reset.

TensorFlow can support metrics of that form. It does not implement arbitrary metrics of that form. Subtle difference. In TensorFlow, the metric has to be a purely tensor operation (not numpy arrays) and it has to be from the library (or implemented individually by hand).

brge17 · 2018-02-06T04:42:33Z

And if that's still not enough...

Someone can always write the stateful metric that caches the entire history of (y_true, y_pred) recompute whatever you want every batch.

thejihuijin · 2018-02-06T05:24:39Z

Adds some nice functionality that enables some key metrics. I support this PR

fchollet · 2018-02-08T05:46:50Z

Merging. These are internal APIs and they may be changed later, so it's not like we're making any momentous decision.

pasky · 2018-02-14T19:58:39Z

@fchollet I was eager to try this out (wrapping tf.metrics.auc), but it seems like the support for stateful metrics (calling reset_states()) was not included in fit_generator - is that intentional?

pasky · 2018-02-14T20:06:24Z

Also, it seems a bit confusing to me that the Layer of a stateful metric doesn't need to have the stateful attribute set - with the assumption that all Layer metrics will be stateful, and behaving statefully even without this attribute. Is that a good assumption to make for the future?

pasky · 2018-02-15T16:57:52Z

Two other pieces of feedback:

In progress bar (verbose=1), stateful metric values aren't formatted as %.4f but %s. This is messy with many metrics, but also sometimes one \b too many is printed (didn't find out why) and the progress bar jumps a line upwards (overwriting earlier content). An easy fix would be to self._values[k] = [v] instead of self._values[k] = v
When logging normal metrics, they are added in _*_loop() to 0., making them float. This doesn't happen with stateful metrics (they are assigned directly), so they are np.float32. This is annoying if you e.g. want to serialize the history object to JSON after training, which used to work fine before.

* 'master' of github.com:fchollet/keras: (57 commits) Minor README edit Speed up Travis tests (keras-team#9386) fix typo (keras-team#9391) Fix style issue in docstring Prepare 2.1.4 release. Fix activity regularizer + model composition test Corrected copyright years (keras-team#9375) Change default interpolation from nearest to bilinear. (keras-team#8849) a capsule cnn on cifar-10 (keras-team#9193) Enable us to use sklearn to do cv for functional api (keras-team#9320) Add support for stateful metrics. (keras-team#9253) The type of list keys was float (keras-team#9324) Fix mnist sklearn wrapper example (keras-team#9317) keras-team#9287 Fix most of the file-handle resource leaks. (keras-team#9309) Pass current learning rate to schedule() in LearningRateScheduler (keras-team#8865) Simplify with from six.moves import input (keras-team#9216) fixed RemoteMonitor: Json to handle np.float32 and np.int32 types (keras-team#9261) Update tweet length from 140 to 280 in docs Add `depthconv_conv2d` tests (keras-team#9225) Remove `force` option in progbar ...

Dref360 · 2018-02-16T15:31:03Z

#9394

This issue requires our attention. We need to compile models before doing anything with them now. It wasn't required before.

pasky · 2018-02-20T16:38:04Z

We will start working on the issues I mentioned.

Add support for stateful metrics.

956794b

fchollet mentioned this pull request Jan 31, 2018

Add Stateful (Global) Metrics #9200

Closed

Dref360 reviewed Jan 31, 2018

View reviewed changes

Fix Theano type issue

c47c4f1

Fix stateful metric test in Theano

1128a6e

Maybe fix CNTK

8df6a2f

Merge branch 'master' into stateful-metrics

c60ff9c

Clean support for stateful metrics logging, some refactoring of progb…

0bb8893

…ar & compile

dfridovi reviewed Feb 6, 2018

View reviewed changes

fchollet merged commit e6c3f77 into master Feb 8, 2018

pasky mentioned this pull request Feb 21, 2018

General stateful metrics fixes #9446

Merged

fchollet deleted the stateful-metrics branch March 1, 2018 17:43

GalAvineri mentioned this pull request Oct 26, 2018

Introduction of global metrics (precision and recall) #8657

Closed

rak5216 mentioned this pull request Jan 20, 2020

Pull in branch - Rak/resolve issue #41 per class metrics and #47 global metrics mit-quest/necstlab-damage-segmentation#46

Merged

Add support for stateful metrics. #9253

Add support for stateful metrics. #9253

Conversation

fchollet commented Jan 31, 2018

brge17 commented Jan 31, 2018 • edited Loading

fchollet commented Jan 31, 2018 • edited Loading

brge17 commented Jan 31, 2018 • edited Loading

fchollet commented Jan 31, 2018

Dref360 Jan 31, 2018

Choose a reason for hiding this comment

Dref360 Jan 31, 2018

Choose a reason for hiding this comment

fchollet Jan 31, 2018

Choose a reason for hiding this comment

brge17 Feb 5, 2018

Choose a reason for hiding this comment

brge17 commented Jan 31, 2018

fchollet commented Jan 31, 2018

fchollet commented Jan 31, 2018

brge17 commented Jan 31, 2018

brge17 commented Jan 31, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

Dref360 commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

fchollet commented Feb 1, 2018

brge17 commented Feb 2, 2018

fchollet commented Feb 2, 2018

brge17 commented Feb 2, 2018

ahundt commented Feb 5, 2018 • edited Loading

briannemsick commented Feb 5, 2018 • edited Loading

briannemsick commented Feb 5, 2018 • edited Loading

ahundt commented Feb 5, 2018 • edited Loading

fchollet commented Feb 6, 2018 • edited Loading

ahundt commented Feb 6, 2018 • edited Loading

dfridovi left a comment

Choose a reason for hiding this comment

brge17 commented Feb 6, 2018 • edited Loading

ozabluda commented Feb 6, 2018

brge17 commented Feb 6, 2018 • edited Loading

ahundt commented Feb 6, 2018 • edited Loading

brge17 commented Feb 6, 2018 • edited Loading

ozabluda commented Feb 6, 2018

ahundt commented Feb 6, 2018 • edited Loading

brge17 commented Feb 6, 2018 • edited Loading

brge17 commented Feb 6, 2018 • edited Loading

brge17 commented Feb 6, 2018 • edited Loading

thejihuijin commented Feb 6, 2018

fchollet commented Feb 8, 2018

pasky commented Feb 14, 2018

pasky commented Feb 14, 2018

pasky commented Feb 15, 2018

Dref360 commented Feb 16, 2018

pasky commented Feb 20, 2018

brge17 commented Jan 31, 2018 •

edited

Loading

fchollet commented Jan 31, 2018 •

edited

Loading

brge17 commented Jan 31, 2018 •

edited

Loading

ahundt commented Feb 5, 2018 •

edited

Loading

briannemsick commented Feb 5, 2018 •

edited

Loading

briannemsick commented Feb 5, 2018 •

edited

Loading

ahundt commented Feb 5, 2018 •

edited

Loading

fchollet commented Feb 6, 2018 •

edited

Loading

ahundt commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading

ahundt commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading

ahundt commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading

brge17 commented Feb 6, 2018 •

edited

Loading