-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
criterions of training and test are mixed up #3
Comments
Thank you for pointing it out. I have rerun the experiments after fixing this bug and found that the performance is slightly improved. |
Could you push your updated code to this repository? I did not get better performances after I fixed the bug and reran the experiments. |
Hi, Please refer to the latest commit. The scripts should produce slightly (if noticeable) better results than the reported ones. |
Hi,
As the table shows, the selective error rate for 95% coverage is 3.72%, which is far away from (3.37±0.05)%. Could you help me solve this problem? |
I am sorry for not explaining |
Hi, It seems that most entries are pretty close to or better than the reported ones in the paper except the case of 95% coverage. I have checked the experiment logs and found that some of the CIFAR10 experiments (but none of the experiments on other datasets) are based on an earlier implementation of SAT, which slightly differs from the current implementation in this line # current implementation
soft_label[torch.arange(y.shape[0]), y] = prob[torch.arange(y.shape[0]), y]
# earlier implementation
soft_label[torch.arange(y.shape[0]), y] = 1 You can try this to see the performance. |
Hi,
The performance is better than that of the current implementaton of SAT. But the selective error rate of coverage 95%, 3.603%, is still not so good as the reported one, (3.37±0.05)%, in your paper. Perhaps you had made a clerical mistake in your paper? |
Interesting reproduction analysis, did this eventually get resolved? |
No, I gave up. This repository does not provide the random seed |
Might I ask you if you know of any other selective classification methods that 'actually work'? |
As far as I know, Deep Ensemble [1] really works and might be the most powerful method. However, considering the heavy computational overhead of ensemble models, recent work in selective classification focuses on individual models. These models (e.g., [2][3]) exhibit marginal improvement from Softmax Response [4]. The advance in this line of work seems neither significant nor exciting. Nevertheless, my survey might be not comprehensive. A more comprehensive survey might be found in [5][6]. [1] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017. |
SAT-selective-cls/train.py
Lines 200 to 201 in dc55593
It might be a mistake to use the same criterion in function
train
and functiontest
, which mixes up histroy of predictions of the model on training set and that on test set .The text was updated successfully, but these errors were encountered: