Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for stateful metrics. #9253

Merged
merged 7 commits into from
Feb 8, 2018
Merged

Add support for stateful metrics. #9253

merged 7 commits into from
Feb 8, 2018

Conversation

fchollet
Copy link
Collaborator

cc @Dref360 @brge17: please check that it looks satisfactory (in particular the UX). Check out the BinaryTruePositives example class in metrics_test.py.

@brge17
Copy link
Contributor

brge17 commented Jan 31, 2018

  1. The progbar does not correctly display training metrics (it still does averaging under the hood). Not sure if this is something in scope for the initial PR. Callbacks that reference the training metrics do receive the correct value, it is purely cosmetic.

Otherwise looks good to me.

@fchollet
Copy link
Collaborator Author

fchollet commented Jan 31, 2018

In general I am not happy with the fact that we have a special handling of stateful metrics in callbacks (and would need the same in the progbar as well). Could we come up with a simple and elegant design to support both stateful metrics and samplewise metrics, I wonder?

An obvious solution would be to cast all metrics as stateful, but that would not be backwards compatible with user-written metrics. chin scratch emoji

@brge17
Copy link
Contributor

brge17 commented Jan 31, 2018

Option 2: I thought of was if the metric is non-stateful write a wrapper that makes it stateful.

That way the losses that double as metrics don't need a loss and metric implementation

@fchollet
Copy link
Collaborator Author

That's a good idea:

  • metric is a layer: assume it's stateful
  • it's a function: wrap it into a layer that does the averaging

But is it a good UX?

# Reset stateful metrics
for m in self.metrics:
if isinstance(m, Layer):
m.reset_states()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this lead to cryptic error if someone forgets to implement reset_states?

The only assumption here is that m is a Layer. And there is no way of knowing what is a stateful metric. I would like to see a compromise between your approach and brge's approach.

class StatefulMetric(Layer):
    def reset_states():
        raise NotImplementedError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, what can I do, knowing that I can now feed a Layer to the metrics arguments? Can I feed any Layer? (No, but some will try)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the middle ground as @Dref360 mentioned.

But, I'm fine either way.

@brge17
Copy link
Contributor

brge17 commented Jan 31, 2018

From a UX prospective, if you have an non-stateful metric (no change) it's handled under the hood. Without digging into the code - no wiser.

Now, if you are in the other camp and need a stateful metric this is great UX because you don't have to write a really hack callback (which is what I had previously done).

That's my 2 cents.

@fchollet
Copy link
Collaborator Author

So to sum up, we could have a world where:

  • callbacks and progbar do no averaging (and are agnostic to how metrics work)
  • there's a Metric class, which is stateful
  • metric functions get wrapped into a SamplewiseMetric subclass which does the averaging (returns sum(val * batch) / samples_see_so_far, like we do now)
  • there's a unified API for interacting with metrics in training.py

Let's look into it?

@fchollet
Copy link
Collaborator Author

For progbar backwards compatibility, averaging should be made an option (with old behavior being the default).

@brge17
Copy link
Contributor

brge17 commented Jan 31, 2018

Couple things:

CNTK tests are failing it looks like.

Is this just for the loss?:

For progbar backwards compatibility, averaging should be made an option (with old behavior being the default).

Otherwise:

metric functions get wrapped into a SamplewiseMetric subclass which does the averaging (returns sum(val * batch) / samples_see_so_far, like we do now)

eliminates the need.

I don't have any free cycles left this week. But, I would be happy to work on this next week.

@brge17
Copy link
Contributor

brge17 commented Jan 31, 2018

The Theano StatefulMetric computing incorrectly is perplexing...

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

The Theano StatefulMetric computing incorrectly is perplexing...

Weird, especially since the test passes with Theano for me locally...

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

The test failure with Theano is non-deterministic (depends on the training data). This does not happen with TF.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

It's a graph dependency issue; true_positives may be returned before or after the last update has been run. Does not happen with TF.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

Fixed by not directly returning the variable being updated. It's kind of a subtle issue. Not happening with TF because we deliberately set updates in the backend to be run after the outputs (in Function).

@Dref360
Copy link
Contributor

Dref360 commented Feb 1, 2018

Is there a way to fix it in theano_backend? Otherwise, users will make the same error.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

@Dref360 is there a Theano API to specify the order in which ops are to be executed? Similar to control_dependencies in TF. If yes, we can do it. If not, never mind. Theano is EOL anyway.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

A complication is that current callbacks expect to be passed batch-wise metrics in on_batch_end, rather than current averages. In that regard, the plan above would be a breaking behavior change.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

This behavior change isn't necessarily a huge deal; for instance no built-in callback meaningfully leverages batch metrics today. Any user of that would be an advanced user that could deal with the change.

I think we will merge this PR (which doesn't affect any current behavior or API, other than enabling the workflow described in the unit test) then investigate the plan described above.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

There's a complex interaction with losses, that makes it impossible to make all metrics stateful (losses, including the total loss, being metrics too).

The only way forward as I see it is to have a formal distinction between samplewise metrics and stateful metrics, and to hand down this information to the BaseLogger and Progbar in a clean way.

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 1, 2018

One more reason is that going all-stateful would break the output of train_on_batch, test_on_batch (which are supposed to return batch-wise metrics). This is an API that quite a few people use.

@brge17
Copy link
Contributor

brge17 commented Feb 2, 2018

Sounds good to me.

The only people who are going to use stateful metrics in the short term also know that it doesn't work with the progbar yet. But they do work with the other callbacks (which is more important).

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 2, 2018

Added clean support for logging stateful metrics. Now properly handled by the progbar. Also did some refactoring while I was at it. PTAL.

@brge17
Copy link
Contributor

brge17 commented Feb 2, 2018

Very nice. The updated progbar is a nice touch.

I double checked a few things locally:

  1. Train metrics in the progress bar at the end of the epoch match the logs and values passed to other callbacks.

  2. Validation metrics in the progress bar at the end of the epoch match the logs and values passed to other callbacks.

  3. Multiple stateful metrics of the same name work.

  4. Adding params works.

  5. Multiple stateful metrics of the same name with different params work.

This is very exciting :)

@ahundt
Copy link
Contributor

ahundt commented Feb 5, 2018

This discussion more or less took place in: #9200 #8657

@brge17 Thanks for the issue links, I hadn't seen them!

@fchollet two stateful metrics API UX questions:

  1. Can tf.metrics and tf.contrib.metrics mentioned in the linked discussion easily be adapted to the proposed API, particularly the streaming_ versions?
  2. Can we design the stateful metrics API UX to look like a streaming statistical API such as boltons.statsutils?

Item 2 is what I was really thinking of and it is a much clearer example than when I posted linking tqdm.

@briannemsick
Copy link
Contributor

briannemsick commented Feb 5, 2018

  1. That's the whole point of this PR is so you can support an arbitrary metric.

With stateful metrics, you can compute any function of y_pred, y_true.

That's why I'm so desperatly trying to get it through. (@brge17 is my work github).

  1. We can't discuss specific stateful metrics, until we have support for the general case. (Can't run before we can walk).

Step 1 is support stateful metrics (this PR), step 2 is write stateful metrics for tf.metrics/common users wants. (follow on PRs). At the very least users can write their own in the mean time.

This API is functionally identical to how TensorFlow metrics work under the hood. We are replicating that functionality.

@briannemsick
Copy link
Contributor

briannemsick commented Feb 5, 2018

Here's an example to stress point 1.

Say you want to have a metric like True Positive Rate:

Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

Option 2: (Naive brute force) cache all predictions and recompute the metric every batch.

The tf.metrics you referenced always do 1 (or option 3 an approximate metric that is more compute/memory efficient see the tf AUC implementation). Which is how we will implement AUC/Precision/Recall/Confusion Matrix, etc... The user always has the option to do 2 (although super inefficient an a huge waste of memory).

As it sits without this PR, you have no options to use metrics as is because they only batchwise averages. The solution to get around it is inefficient/hacky code in a custom callback.

As the PR currently sits, no changes if you are using non-stateful metrics. It just enables the future development of stateful metrics.

@ahundt
Copy link
Contributor

ahundt commented Feb 5, 2018

Here's an example to stress point 1.
Say you want to a metric like True Positive Rate:
Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

tf.contrib.metrics.streaming_true_positives does Option 1, see the functions prefixed with streaming_. We're on the same page here, I program robots which are basically real time streaming data sources. :-)

That's why I'm so desperately trying to get it through.

@briannemsick I certainly don't want to put you in a bind, could you check out this PR now locally then update to the final API when it is released?

François did specifically ask for UX review:

please check that it looks satisfactory (in particular the UX).

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 6, 2018

ask for UX review

The progbar is a purely internal API and thus not part of the UX of this feature. The UX is basically just the experience of writing stateful metrics, and the consistency of what we log on screen.

@ahundt
Copy link
Contributor

ahundt commented Feb 6, 2018

The progbar is a purely internal API and thus not part of the UX of this feature. The UX is basically just the experience of writing stateful metrics, and the consistency of what we log on screen.

@fchollet I promise I'm not trying to waste time discussing the progbar! I'm very sorry I originally mixed in another topic.

I'm actually trying to propose a composable stateful metric design based on statistical streaming APIs, it should only be a slight tweak. Please see #9253 (comment) which is up three.

Copy link

@dfridovi dfridovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea! Supporting stateful metrics like this is going to be really useful for applications where pure classification errors like cross entropy aren't enough. I also think the interface is pretty clean.

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

@ahundt

  1. The API we are proposing is equivalent and consistent with how these Stateful Metrics are implemented in TensorFlow.

  2. The API you are proposing is un-Keras-like. Besides the properties additions (that should not be required), the proposed change is a rename of core functions that are shared between metrics, losses, and layers. update is the same as __call__ , clear is the same as reset_states. It does not make sense to have the API inconsistent with the rest of Keras.

Can we get a review from @Kritikalcoder, @ozabluda, or @hasded

or can we...:

I think we will merge this PR (which doesn't affect any current behavior or API, other than enabling the workflow described in the unit test) then investigate the plan described above.

@ozabluda
Copy link
Contributor

ozabluda commented Feb 6, 2018

Sorry for the monotonous drone, and sorry for not being able to really follow this (and related) discussions until I can understand the following:

  1. Would we/user be able to just plug in arbitrary TF and sklearn metrics (see comments below), almost all of which have signature blah(labels, predictions)? Or we/user have to reimplement them all forever.
  2. ... for example confusion matrix, which is my first UX test case to make sure it all works OK.

#8657 (comment)
#8657 (comment)

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

@ozabluda

The answer is no, but that's also the same answer in TensorFlow.

The reason is, to wrap any* arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions which is memory inefficient and poor design.

This will get you the confusion matrix, because I will be personally implementing them (and I will post them publicly). But, they will carefully be written to store a state and update the function accurately without caching every prediction.

It may be unsatisfying, but the simple easy solution is compute and memory inefficient/not production grade as @fchollet highlighted in #8657.

This is a step forward :/

The benchmark in my mind is that currently you can't implement them even if you wanted too...

@ahundt
Copy link
Contributor

ahundt commented Feb 6, 2018

What do you think of the following?

tp = BinaryTruePositive()
fp = BinaryFalsePositive()
tpfp = Add()(tp, fp)
precision = Divide()(tp, tpfp)
recall = Lambda(lambda x, y: tf.contrib.metrics.streaming_recall(x, y))

# Test on simple model
inputs = keras.Input(shape=(2,))
outputs = keras.layers.Dense(1, activation='sigmoid')(inputs)
model = keras.Model(inputs, outputs)
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['acc', precision, recall])

The reason is, to wrap an arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions.

This assertion isn't correct for streaming algorithms.

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

Say

you want to have a metric like True Positive Rate:

Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives.

Option 2: (Naive brute force) cache all predictions and recompute the metric every batch.


It should say any not an.

@ozabluda
Copy link
Contributor

ozabluda commented Feb 6, 2018

The reason is, to wrap an arbitrary function def my_func(y_true, y_pred) you would have to cache the tensors of all predictions which is memory inefficient and poor design.

How about a simpler question for starters: TF metrics only. If TF can do it, so can Keras, right? Also, TF can optionally put them into CPU RAM, right?

Now sklearn metrics. Those have to be accumulated on CPU anyway. Again, if sklearn can do it, so can Keras, right?

If a user runs out of memory for either one of those, it's no different than running out of mem due to minibatch being too large or whatever. Then you solve/optimize it, but not prematurely.

Some metrics can be computed incrementally, batch-by-batch. Confusion matrix is one of those, as is the myriad of metrics that follow from it. So is ROC/AUC. TF metrics docs do often mention "estimation of the metric over a stream of data ...", which should be accommodated.

@ahundt
Copy link
Contributor

ahundt commented Feb 6, 2018

It's like we designed the API to replicate what they did under the hood?

Yes, that's a good idea. We should do it all under the hood sticking as closely as possible to existing Keras conventions, assuming that's viable.

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

@ozabluda

TensorFlow has stateful metrics because they support stateful metrics. They have reset_ops, they have updates.

TensorFlow does not take any arbitrary metric of the formmy_metric(y_true, y_pred). They have pre-built polished metrics that have efficient stateful representations and require the user to write their own otherwise.

Sklearn is neither a deep learning library nor computes in batches, so that's an unfair comparison.

This PR enables all the TensorFlow streaming metrics, they just are not all implemented yet. In the tests, there is truepositives. With that as a template you can implement the confusion matrix, precision, recall, auc, etc... or wait for the follow on PR to this one where we implement the TF streaming metrics.

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

@ozabluda

How about a simpler question for starters: TF metrics only. If TF can do it, so can Keras, right?

See below, what you are claiming it does implement, it can implement if the user writes tensor operations for a specific metric.

. TF metrics docs do often mention "estimation of the metric over a stream of data ...", which should be accommodated.

That's verbatim what the PR allows you to do.

The metric receives the data from the batch, the state is updated, and each epoch it is auto-reset.

TensorFlow can support metrics of that form. It does not implement arbitrary metrics of that form. Subtle difference. In TensorFlow, the metric has to be a purely tensor operation (not numpy arrays) and it has to be from the library (or implemented individually by hand).

@brge17
Copy link
Contributor

brge17 commented Feb 6, 2018

And if that's still not enough...

Someone can always write the stateful metric that caches the entire history of (y_true, y_pred) recompute whatever you want every batch.

@thejihuijin
Copy link

Adds some nice functionality that enables some key metrics. I support this PR

@fchollet
Copy link
Collaborator Author

fchollet commented Feb 8, 2018

Merging. These are internal APIs and they may be changed later, so it's not like we're making any momentous decision.

@fchollet fchollet merged commit e6c3f77 into master Feb 8, 2018
@pasky
Copy link
Contributor

pasky commented Feb 14, 2018

@fchollet I was eager to try this out (wrapping tf.metrics.auc), but it seems like the support for stateful metrics (calling reset_states()) was not included in fit_generator - is that intentional?

@pasky
Copy link
Contributor

pasky commented Feb 14, 2018

Also, it seems a bit confusing to me that the Layer of a stateful metric doesn't need to have the stateful attribute set - with the assumption that all Layer metrics will be stateful, and behaving statefully even without this attribute. Is that a good assumption to make for the future?

@pasky
Copy link
Contributor

pasky commented Feb 15, 2018

Two other pieces of feedback:

  • In progress bar (verbose=1), stateful metric values aren't formatted as %.4f but %s. This is messy with many metrics, but also sometimes one \b too many is printed (didn't find out why) and the progress bar jumps a line upwards (overwriting earlier content). An easy fix would be to self._values[k] = [v] instead of self._values[k] = v

  • When logging normal metrics, they are added in _*_loop() to 0., making them float. This doesn't happen with stateful metrics (they are assigned directly), so they are np.float32. This is annoying if you e.g. want to serialize the history object to JSON after training, which used to work fine before.

ahundt added a commit to ahundt/keras that referenced this pull request Feb 16, 2018
* 'master' of github.com:fchollet/keras: (57 commits)
  Minor README edit
  Speed up Travis tests (keras-team#9386)
  fix typo (keras-team#9391)
  Fix style issue in docstring
  Prepare 2.1.4 release.
  Fix activity regularizer + model composition test
  Corrected copyright years (keras-team#9375)
  Change default interpolation from nearest to bilinear. (keras-team#8849)
  a capsule cnn on cifar-10 (keras-team#9193)
  Enable us to use sklearn to do cv for functional api (keras-team#9320)
  Add support for stateful metrics. (keras-team#9253)
  The type of list keys was float (keras-team#9324)
  Fix mnist sklearn wrapper example (keras-team#9317)
  keras-team#9287 Fix most of the file-handle resource leaks. (keras-team#9309)
  Pass current learning rate to schedule() in LearningRateScheduler (keras-team#8865)
  Simplify with from six.moves import input (keras-team#9216)
  fixed RemoteMonitor: Json to handle np.float32 and np.int32 types (keras-team#9261)
  Update tweet length from 140 to 280 in docs
  Add `depthconv_conv2d` tests (keras-team#9225)
  Remove `force` option in progbar
  ...
@Dref360
Copy link
Contributor

Dref360 commented Feb 16, 2018

#9394

This issue requires our attention. We need to compile models before doing anything with them now. It wasn't required before.

@pasky
Copy link
Contributor

pasky commented Feb 20, 2018

We will start working on the issues I mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants