Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More on Reproducing CIFAR10 supervised results #10

Open
drcdr opened this issue Oct 28, 2019 · 10 comments
Open

More on Reproducing CIFAR10 supervised results #10

drcdr opened this issue Oct 28, 2019 · 10 comments

Comments

@drcdr
Copy link

drcdr commented Oct 28, 2019

[This is similar to #5, but with the current code base and more networks.]

I am trying to recreate the Manifold Mixup CIFAR10 results, it seems that Manifold Mixup is a very promising development! I'm using the command lines from the project's README.md. I'm using Windows10, TitanXP, Python 3.7, PyTorch nightly (1.2, 7/6/2019), torchvision 0.3, and other packages the same or (mostly) slightly newer. My manifold_mixup version is 10/16/2019.

I only had to make one slight change, for torchvision 0.3: get_sampler(train_data.targets, ...) instead of get_sample(train_data.train_labels, ...).

Below, I show the test results from your paper, along with the results that I got. End is the final test error; best is the best test error during the run. The column "z" is a z-score, based on the mean μ and stdev σ from the arXiv paper, and my results. A negative z-score indicates that my results had a lower test error; a positive z-score = a higher test error. CLFR=="Command Line From README.md".

The results are mixed, and I'm not sure why; I thought you might have some thoughts. I'm seeing:

  • PreActResNet18: much better NoMixup and Input Mixup, about the same Manifold Mixup
  • PreActResNet34: somewhat worse Input Mixup, similar Manifold Mixup
  • WRN28-10: about the same NoMixup and Input Mixup, much worse Manifold Mixup

I accidentally tried Manifold Mixup without mixup_hidden for WRN28-10 (i.e. mixup, alpha=2.0), and actually got the mean result reported in the paper.

Any ideas? Some questions:

  • Are the results in the arXiv paper the "Best" value, or the "End" value?
  • I assume the results in the paper use {mixup_hidden, alpha=2} for mixup?
  • Is the current github software different from that used in the paper, in any substantial way?
  • Curious, are my run-times in the same ballpark as yours?

CIFAR 10 Err Err Tm End End Best Best Best CLFR
μ σ [hrs] Err z Iter Err z
PreActResNet18
No Mixup 4.83 .066 28.5 4.59 -3.6 642 4.4 -6.5 Y
AdaMix (Guo) 3.52
Input Mixup (Zhang) 4.2
Input Mixup (α = 1) 3.82 0.048 30 3.43 -8.1 1687 3.15 -14.0 Y
Manifold Mixup (α = 2) 2.95 0.046 32 3.18 5.0 1640 3.01 1.3 Y
PreActResNet34
No Mixup 4.64 .072
Input Mixup (α = 1) 2.88 0.043 44 3.21 7.7 1159 2.99 2.6 Y
Manifold Mixup (α = 2) 2.54 0.047 45 2.7 3.4 1230 2.47 -1.5 Y
Wide-Resnet-28-10
No Mixup 3.99 .118 19 4.12 1.1 299 3.89 -0.8 Y
Input Mixup (α = 1) 2.92 .088 20.5 2.79 -1.5 367 2.76 -1.8 Y
Manifold Mixup (α = 2) 2.55 .024 19 2.97 17.5 353 2.82 11.3 Y
Manifold Mixup (α = 2) , but not mixup_hidden 2.55 .024 18.5 2.73 7.5 391 2.55 0.0 N

Also, here is a plot of the test error, for each of the scenarios above. (The pink wrn28_10_mixup_alpha=0 is shortened / offset to the left, because it's from a restart.) Notably:

  • the 'best' error (marked by the bold 'x') is often a momentary low spike during the training session, often not near the final test error.
  • the blow-up behavior of the green plot (preactresnet18, vanilla) at iterations 701 and 919 is strange

ManifoldMixupTestErr

@drcdr
Copy link
Author

drcdr commented Nov 4, 2019

@vikasverma1077 or @alexmlamb - any thoughts?

@vikasverma1077
Copy link
Owner

vikasverma1077 commented Nov 4, 2019

Hi @drcdr thanks for your interest. Unfortunately, I do not have time to go through the details for your experiments at the moment. I would recommend using the same packages as in the README and reproduce the results first. Several people have reproduced the results so I am pretty sure it will work for you as well.

answer to your questions:

Are the results in the arXiv paper the "Best" value, or the "End" value?
 Best value
I assume the results in the paper use {mixup_hidden, alpha=2} for mixup?
 {mixup_hidden, alpha=2} is Manifold Mixup

Is the current github software different from that used in the paper, in any substantial way?
No
Curious, are my run-times in the same ballpark as yours?
Yes

@alexmlamb
Copy link
Collaborator

Thanks for taking the time to look into it. It's good that you got similar results for manifold mixup on the preactresnet architectures.

Also if you fixed the data loader for a newer pytorch version, can you open a pull request for that? I think other users would benefit from that change.

WRN28-10: about the same NoMixup and Input Mixup, much worse Manifold Mixup

I'd have to check, but I wonder if the choice of layers to mix in could be set incorrectly for WRN?

The paper says:

"When using Manifold Mixup, we selected the layer to perform mixing uniformly at random from a set of eligible layers. In all our experiments, for the PreActResNets architectures, the eligible layers for mixing in Manifold Mixup were : the input layer, the output from the first resblock, and the output from the second resblock. For Wide-ResNet-20-10 architecture, the eligible layers for mixing in Manifold Mixup were: the input layer and the output from the first resblock."

So maybe the code is mixing in too many layers for WRN? I haven't investigated closely.

@drcdr
Copy link
Author

drcdr commented Nov 5, 2019

Thanks guys, I'm trying to figure out where to go next. Trying a model or two in torchvision 0.2.1 seems like a good idea, given what you both have said; I would just need some time. I could maybe try and diff these 3 models in the two torchvisions, but I suppose that's not 100% conclusive either.

I'm trying to think through the relative high variability between Best and End, and what that means.
Since the Best results seem like outliers to me (they are the min value over 100's of iterations) (todo: histogram the test error), I'm not exactly sure what comparing these results really means then.

But since the primary goal here was reproducibility, I suppose I should focus on that first. I'll try and repost in a few days. Thanks!

@alexmlamb
Copy link
Collaborator

Can you clarify which of the results in the table you posted are from your experiments and which are taken from the paper?

@drcdr
Copy link
Author

drcdr commented Nov 5, 2019

Yes. The first three columns (Header, Err μ, Err σ) are taken from the the first two columns of Table 1(a) in the paper. The rest of the columns refer to my experiments.

@alexmlamb
Copy link
Collaborator

What is the difference between "Manifold Mixup (α = 2)" and "Manifold Mixup (α = 2) , but not mixup_hidden" for the WRN results?

@drcdr
Copy link
Author

drcdr commented Nov 5, 2019

"Manifold Mixup (α = 2)": I ran the command line as given on README.md, for "Manifold mixup WRN-28-10"

"Manifold Mixup (α = 2) , but not mixup_hidden": I accidentally used '--train mixup' instead of '--train mixup_hidden', but otherwise the same as "Manifold Mixup (α = 2)"

@drcdr
Copy link
Author

drcdr commented Nov 5, 2019

I would recommend using the same packages as in the README and reproduce the results first...

I've run the first two experiments, on WRN28_10, using the same packages on the README. Results for Best Error:

  • Input Mixup (α = 1): paper= 2.92+/-.088; result from above=2.76; new result =2.67
  • Manifold Mixup (α = 2): paper=2.55+/-.024; result from above = 2.82; new result = 2.77

Also, I compared the printouts of the WRN model, from both torchvision 0.2.1 and 0.3, they are identical.

@drcdr
Copy link
Author

drcdr commented Nov 8, 2019

Update

Here is a table of Test Error results, with updates from using the packages on the README (columns K-O).

  • You can focus on columns B, I, and N for the Best Errors from each of the runs (paper, current PyTorch, and old PyTorch, respectively).
  • Colored highlights in the 'Best z' columns (J,O) give a relative indication of results compared to the paper: the greener, the better the paper; the redder, the better these results.
  • Blackout indicates a trial that was not run.

image

Here's the plot of TestError vs. Epoch:
ManifoldMixupTestErr-2

Summary

  1. On Manifold Mixup repeatability (line 9, 14, 18): I get roughly repeatable results for Resnet, but worse results for WRN28.
  2. I don't think there is a significant difference in PyTorch versions (columns I and N). Where one is better or worse, it doesn't seem to be statistically that much different. A notable difference might be row 19, but that's not a 'README case'.
  3. The test-error divergence anomaly for (PreActResNet18, Vanilla) was repeated. The first time, it blew up at epochs (701, 919); the second time, at (698, 889). Strange.
  4. Regarding plain Input Mixup, I am getting somewhat better results for WRN28-10 (row 17: 2.76, 2.67 vs 2.92) and substantially better for PreActResNet18 (row 8: 3.15, 3.06 vs 3.82).

I think this issue could be kept open to track (1) [WRN28 MM worse] and possibly (3) [PARN18-vanilla test-error divergence] and (4) [what results do you get for line 8, e.g.]. If there is anything you can think of that I can do for (1) or (3), please let me know.

Possible PR

@alexmlamb - re the pull request: would you want me to first test with the latest pytorch/torchvision beforehand (torch vision is now 0.5.0!)? For anyone who wants to run CIFAR10 with torchvsion 0.3.0, the change is one line in load_data.py:

#train_sampler, valid_sampler, unlabelled_sampler = get_sampler(train_data.train_labels, labels_per_class, valid_labels_per_class)  # older torchvision
train_sampler, valid_sampler, unlabelled_sampler = get_sampler(train_data.targets, labels_per_class, valid_labels_per_class)  # newer torchvision 

Other changes may be needed for other datasets, but I don't have the time/GPU cards to test all of these. Also, in torchvision, there is actually a warning for MNIST (but not for CIFAR10) - see:
https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py#L45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants