-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for stateful metrics. #9253
Conversation
Otherwise looks good to me. |
In general I am not happy with the fact that we have a special handling of stateful metrics in callbacks (and would need the same in the progbar as well). Could we come up with a simple and elegant design to support both stateful metrics and samplewise metrics, I wonder? An obvious solution would be to cast all metrics as stateful, but that would not be backwards compatible with user-written metrics. chin scratch emoji |
Option 2: I thought of was if the metric is non-stateful write a wrapper that makes it stateful. That way the losses that double as metrics don't need a loss and metric implementation |
That's a good idea:
But is it a good UX? |
# Reset stateful metrics | ||
for m in self.metrics: | ||
if isinstance(m, Layer): | ||
m.reset_states() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this lead to cryptic error if someone forgets to implement reset_states?
The only assumption here is that m is a Layer. And there is no way of knowing what is a stateful metric. I would like to see a compromise between your approach and brge's approach.
class StatefulMetric(Layer):
def reset_states():
raise NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, what can I do, knowing that I can now feed a Layer to the metrics
arguments? Can I feed any Layer? (No, but some will try)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the middle ground as @Dref360 mentioned.
But, I'm fine either way.
From a UX prospective, if you have an non-stateful metric (no change) it's handled under the hood. Without digging into the code - no wiser. Now, if you are in the other camp and need a stateful metric this is great UX because you don't have to write a really hack callback (which is what I had previously done). That's my 2 cents. |
So to sum up, we could have a world where:
Let's look into it? |
For progbar backwards compatibility, averaging should be made an option (with old behavior being the default). |
Couple things: CNTK tests are failing it looks like. Is this just for the loss?:
Otherwise:
eliminates the need. I don't have any free cycles left this week. But, I would be happy to work on this next week. |
The Theano StatefulMetric computing incorrectly is perplexing... |
Weird, especially since the test passes with Theano for me locally... |
The test failure with Theano is non-deterministic (depends on the training data). This does not happen with TF. |
It's a graph dependency issue; |
Fixed by not directly returning the variable being updated. It's kind of a subtle issue. Not happening with TF because we deliberately set updates in the backend to be run after the outputs (in |
Is there a way to fix it in theano_backend? Otherwise, users will make the same error. |
@Dref360 is there a Theano API to specify the order in which ops are to be executed? Similar to |
A complication is that current callbacks expect to be passed batch-wise metrics in |
This behavior change isn't necessarily a huge deal; for instance no built-in callback meaningfully leverages batch metrics today. Any user of that would be an advanced user that could deal with the change. I think we will merge this PR (which doesn't affect any current behavior or API, other than enabling the workflow described in the unit test) then investigate the plan described above. |
There's a complex interaction with losses, that makes it impossible to make all metrics stateful (losses, including the total loss, being metrics too). The only way forward as I see it is to have a formal distinction between samplewise metrics and stateful metrics, and to hand down this information to the BaseLogger and Progbar in a clean way. |
One more reason is that going all-stateful would break the output of |
Sounds good to me. The only people who are going to use stateful metrics in the short term also know that it doesn't work with the progbar yet. But they do work with the other callbacks (which is more important). |
Added clean support for logging stateful metrics. Now properly handled by the progbar. Also did some refactoring while I was at it. PTAL. |
Very nice. The updated progbar is a nice touch. I double checked a few things locally:
This is very exciting :) |
@brge17 Thanks for the issue links, I hadn't seen them! @fchollet two stateful metrics API UX questions:
Item 2 is what I was really thinking of and it is a much clearer example than when I posted linking tqdm. |
With stateful metrics, you can compute any function of y_pred, y_true. That's why I'm so desperatly trying to get it through. (@brge17 is my work github).
Step 1 is support stateful metrics (this PR), step 2 is write stateful metrics for tf.metrics/common users wants. (follow on PRs). At the very least users can write their own in the mean time. This API is functionally identical to how TensorFlow metrics work under the hood. We are replicating that functionality. |
Here's an example to stress point 1. Say you want to have a metric like True Positive Rate: Option 1: (Use states wisely) Save the number of True Positives and the number of Positives in the states. Each batch return True Positives/Positives. Option 2: (Naive brute force) cache all predictions and recompute the metric every batch. The tf.metrics you referenced always do 1 (or option 3 an approximate metric that is more compute/memory efficient see the tf AUC implementation). Which is how we will implement AUC/Precision/Recall/Confusion Matrix, etc... The user always has the option to do 2 (although super inefficient an a huge waste of memory). As it sits without this PR, you have no options to use metrics as is because they only batchwise averages. The solution to get around it is inefficient/hacky code in a custom callback. As the PR currently sits, no changes if you are using non-stateful metrics. It just enables the future development of stateful metrics. |
tf.contrib.metrics.streaming_true_positives does Option 1, see the functions prefixed with
@briannemsick I certainly don't want to put you in a bind, could you check out this PR now locally then update to the final API when it is released? François did specifically ask for UX review:
|
The progbar is a purely internal API and thus not part of the UX of this feature. The UX is basically just the experience of writing stateful metrics, and the consistency of what we log on screen. |
@fchollet I promise I'm not trying to waste time discussing the progbar! I'm very sorry I originally mixed in another topic. I'm actually trying to propose a composable stateful metric design based on statistical streaming APIs, it should only be a slight tweak. Please see #9253 (comment) which is up three. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea! Supporting stateful metrics like this is going to be really useful for applications where pure classification errors like cross entropy aren't enough. I also think the interface is pretty clean.
Can we get a review from @Kritikalcoder, @ozabluda, or @hasded or can we...:
|
Sorry for the monotonous drone, and sorry for not being able to really follow this (and related) discussions until I can understand the following:
|
The answer is no, but that's also the same answer in TensorFlow. The reason is, to wrap any* arbitrary function This will get you the confusion matrix, because I will be personally implementing them (and I will post them publicly). But, they will carefully be written to store a state and update the function accurately without caching every prediction. It may be unsatisfying, but the simple easy solution is compute and memory inefficient/not production grade as @fchollet highlighted in #8657. This is a step forward :/ The benchmark in my mind is that currently you can't implement them even if you wanted too... |
What do you think of the following? tp = BinaryTruePositive()
fp = BinaryFalsePositive()
tpfp = Add()(tp, fp)
precision = Divide()(tp, tpfp)
recall = Lambda(lambda x, y: tf.contrib.metrics.streaming_recall(x, y))
# Test on simple model
inputs = keras.Input(shape=(2,))
outputs = keras.layers.Dense(1, activation='sigmoid')(inputs)
model = keras.Model(inputs, outputs)
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['acc', precision, recall])
This assertion isn't correct for streaming algorithms. |
It should say any not an. |
How about a simpler question for starters: TF metrics only. If TF can do it, so can Keras, right? Also, TF can optionally put them into CPU RAM, right? Now sklearn metrics. Those have to be accumulated on CPU anyway. Again, if sklearn can do it, so can Keras, right? If a user runs out of memory for either one of those, it's no different than running out of mem due to minibatch being too large or whatever. Then you solve/optimize it, but not prematurely. Some metrics can be computed incrementally, batch-by-batch. Confusion matrix is one of those, as is the myriad of metrics that follow from it. So is ROC/AUC. TF metrics docs do often mention "estimation of the metric over a stream of data ...", which should be accommodated. |
Yes, that's a good idea. We should do it all under the hood sticking as closely as possible to existing Keras conventions, assuming that's viable. |
TensorFlow has stateful metrics because they support stateful metrics. They have reset_ops, they have updates. TensorFlow does not take any arbitrary metric of the form Sklearn is neither a deep learning library nor computes in batches, so that's an unfair comparison. This PR enables all the TensorFlow streaming metrics, they just are not all implemented yet. In the tests, there is truepositives. With that as a template you can implement the confusion matrix, precision, recall, auc, etc... or wait for the follow on PR to this one where we implement the TF streaming metrics. |
See below, what you are claiming it does implement, it can implement if the user writes tensor operations for a specific metric.
That's verbatim what the PR allows you to do. The metric receives the data from the batch, the state is updated, and each epoch it is auto-reset. TensorFlow can support metrics of that form. It does not implement arbitrary metrics of that form. Subtle difference. In TensorFlow, the metric has to be a purely tensor operation (not numpy arrays) and it has to be from the library (or implemented individually by hand). |
And if that's still not enough... Someone can always write the stateful metric that caches the entire history of (y_true, y_pred) recompute whatever you want every batch. |
Adds some nice functionality that enables some key metrics. I support this PR |
Merging. These are internal APIs and they may be changed later, so it's not like we're making any momentous decision. |
@fchollet I was eager to try this out (wrapping tf.metrics.auc), but it seems like the support for stateful metrics (calling |
Also, it seems a bit confusing to me that the Layer of a stateful metric doesn't need to have the |
Two other pieces of feedback:
|
* 'master' of github.com:fchollet/keras: (57 commits) Minor README edit Speed up Travis tests (keras-team#9386) fix typo (keras-team#9391) Fix style issue in docstring Prepare 2.1.4 release. Fix activity regularizer + model composition test Corrected copyright years (keras-team#9375) Change default interpolation from nearest to bilinear. (keras-team#8849) a capsule cnn on cifar-10 (keras-team#9193) Enable us to use sklearn to do cv for functional api (keras-team#9320) Add support for stateful metrics. (keras-team#9253) The type of list keys was float (keras-team#9324) Fix mnist sklearn wrapper example (keras-team#9317) keras-team#9287 Fix most of the file-handle resource leaks. (keras-team#9309) Pass current learning rate to schedule() in LearningRateScheduler (keras-team#8865) Simplify with from six.moves import input (keras-team#9216) fixed RemoteMonitor: Json to handle np.float32 and np.int32 types (keras-team#9261) Update tweet length from 140 to 280 in docs Add `depthconv_conv2d` tests (keras-team#9225) Remove `force` option in progbar ...
This issue requires our attention. We need to compile models before doing anything with them now. It wasn't required before. |
We will start working on the issues I mentioned. |
cc @Dref360 @brge17: please check that it looks satisfactory (in particular the UX). Check out the
BinaryTruePositives
example class inmetrics_test.py
.