-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for reproducing cifar-10 examples in "Deep Residual Learni… #38
Conversation
…ng for Image Recognition"
Thanks! Look forward to trying it out, I've been meaning to read that paper. About how long does it take to train? When you say it learns unstably do you mean the loss is noisy, or that on some runs it diverges/fails to learn? If the second, might be good to seed the RNG with a known good value. You might be able to fix the RuntimeError with something like
|
I'm running it now. Let me try it. |
Training a 32-layer network takes 8-9 hours on an EC2 g2.2 instance with CuDNN3. By unstable, I meant that the loss is more noisy than what it looks like in figure 6 in the paper. The training always seems to converge nicely, though I've only done a handful of runs so far. I've started training a 56-layer network by adjusting the recursion limit as you mention, so I'm going to run it over night and will report the result. |
Great, I'll wait to merge this then in case you want to add that? It would be nice to have pretrained weights available. If the file is small (<10M), you can add it directly to the repo. Otherwise I can send you access credentials for the Recipes S3 bucket. |
Yes, merging can wait. Also don't know whether this merge should wait until PR #467 is merged? The pretrained weights for the 32-layer network is only 1.9MB, so I can add it directly in repo. Where should I put it? Any specific format you want it in? |
Running result on our computer:
|
@bobchennan was this results for a 56-layer network? I seemed to get similar results, with 92.17% test accuracy, i.e. a bit poorer than the paper. |
No, it is the result given by 32 layers(without any parameters specified,
I slightly change the structure and get better result:
I think you'd better specified the random seed. It helps to get stable 2015-12-27 17:38 GMT+08:00 Audun Mathias Øygard [email protected]:
|
Yes, the final accuracy seem to vary a bit, but I'm not sure if it's within what is expected, in the paper they report a standard error of 0.16 on the 110-layer model. It's probably a good idea to specify the random seed, so I'll look into that. I've also noticed I have a batch-normalization layer both before and after summing shortcuts, which is probably unnecessary, so I'll remove it. How did you change the structure to get better results? |
Used 10 filter maps before the GlobalPool(Mentioned in the NIN Paper):
Removed the shortcuts when the dimension increased:
|
# identity shortcut, as option A in paper | ||
# we use a pooling layer to get identity with strides, since identity layers with stride don't exist in Lasagne | ||
identity = Pool2DLayer(l, pool_size=1, stride=(2,2), mode='average_exc_pad') | ||
padding = PadLayer(identity, [out_num_filters/4,0,0], batch_ndim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably better be out_num_filters // 4
so the result is always int
.
Ok, I think this is ready for merge if it looks good. After removing superfluous batch-normalizations, the network seems to learn just as fast and stable as in the paper, and with similar final accuracy. For the 56-layer network I didn't manage to reach error of 6.97, only 7.23, but there is some variance in the final accuracy (as reported in the paper for 110-layer network) so I might reach it if I run it more times. The model still seems to have slightly more parameters than reported in the paper (19.5M for the 1202-layer network, versus 19.4M reported in the paper), but I can't quite figure out where the difference is. I can upload the trained 32-layer and 56-layer model, but not sure where I should put it? Also, I tried setting up seeding, but didn't manage to get consistent results even though I disabled cuDNN and set seed for cropping and shuffling and in lasagne, so not sure whether the issue is. |
padding = PadLayer(identity, [out_num_filters//4,0,0], batch_ndim=1) | ||
block = NonlinearityLayer(ElemwiseSumLayer([stack_2, padding]),nonlinearity=rectify) | ||
else: | ||
block = NonlinearityLayer(ElemwiseSumLayer([stack_2, l]),nonlinearity=rectify) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! I would be extremely interested to see whether performance improves if you remove the ReLU layer just after each block.
In an experiment on my own Torch implementation of residual networks, removing the ReLU layer after the building block noticeably improves initial convergence. I think this is because the ReLU layer at the end of each block mutates the input, making identity connections no longer possible.
I'm curious to see whether this has an effect on this project as well. Or perhaps there's some other bug in my own implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I tried to delete the ReLU layer at the end of the block, and indeed it seems to converge faster in the beginning, though it settles into about the same convergence speed after a while. I didn't fully train the model though, so I don't know if the final accuracy is better or the same as the one described in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for rerunning your experiments! Good to know it might have some effect.
Can you make a new subdirectory under |
Would be good to further scrutinize this, to ensure the model is identical to the one in the paper. I don't have the time to check the full paper right now, can you give us some pointers to which pages/sections the model you're trying to replicate is specified at, so we can compare to your code? Thanks! |
The examples I'm trying to reproduce is from chapter 4.2 in the paper. However, I wonder if the culprit of the increased amount of parameters is the way the parameters is counted. The batch-normalization layers in lasagne count four parameters (mean,std,beta and gamma) for each feature, i.e. for the 1202-layer model (n=200) we introduce 89600 parameters ((16+32+64) * 4 * 200) per BN in the residual blocks. I was wondering though, if it's common to count the mean and std parameters of the BN as parameters of the model, as they're not "learned", but rather "given" by the dataset. If they did not count these in the paper, then the parameter count would be the same, since we lose 89600 parameters for the 1202-layer model. @ebenolson : It turns out I'd copied the wrong model files from my EC2 instance, so I'll have to retrain the models again. I'll upload them once I've got them trained. |
That's a good explanation for the difference. Note that you can do |
Hi,
What is the problem? It's weird since I have been using the |
I believe this may be a Theano issue: Theano/Theano#3845 |
@benanne but I'm using |
Not exactly. The issue is about the "half" mode in Theano's cuDNN convolution wrapper. This is not only used when you explicitly use Lasagne's |
Oh I see. As you said, I first set the Theano flag to exclude dnn. However, it complained about the cuDNN again. Afterwards I manually replaced the |
I now understand that the memory used isn't completely correlated with the number of parameters. After profiling the program I found out that the |
Surely then it would be more sensible to just halve the batch size and do twice as many updates :) |
Currently I can't use cuDNN, that's why I have to use
I haven't even been able to run the code with minibatch size of 1! However, there might be errors in my code. Specially I intended to implement a model inspired by ResNet with residual blocks of 3 convolutional layers. I have recently found out about (a seemingly reference) implementation of the 152-layer ResNet. I will try to implement a Residual network based on this implementation until this weekend and I may be able to run it on a bigger GPU. And I will be using cuDNN as well. I have now managed to run a very small network similar to the current one with initial filter size of 16 and 4 blocks of increasing dimensionality (total of 9 conv layers) with very small number of parameters (~457k). Another observation from the output of the profiling was that
There isn't a C implementation for this layer or it switches back to the Python implementation for some reason? |
This is not the
This can lead to worse results, though, and if the idea is to perfectly follow the paper, the batch size should probably stay the same. But yeah, it would definitely be easier to implement and could work as well.
Well, in that case there's not a lot you can do, I fear... I don't see anything wrong with your Theano profile either (except that I wouldn't leave profiling on by default). |
Yes you're right. It's the
Yeah, I think I shall move to a bigger GPU and get along with shallower networks. BTW, profiling isn't usually on. That was my "debugging" config file 😄. Update: I replaced that line with your code, and it significantly reduced the minibatch time. Thank you @f0k! |
To avoid this, you can just accumulate several gradients for several batches and then only update the weights when you're ready. This is similar to Caffe's In theory, if you can fit a single image on your GPU, you'll be able to train by setting an appropriate iteration size. Here's a code sketch: local loss_val = 0
local N = 2 -- How many "sub-batches" to accumulate per batch
local inputs, labels
gradients:zero() -- Only done here, at the beginning
for i=1,N do
inputs, labels = dataTrain:getBatch()
local y = model:forward(inputs)
loss_val = loss_val + loss:forward(y, labels)
local df_dw = loss:backward(y, labels) -- Accumulates gradients
model:backward(inputs, df_dw)
end
loss_val = loss_val / N
gradients:mul( 1.0 / N )
optim.sgd(...) -- NOW update weights! Important caveat: I'm not sure how this interacts with batch normalization layers. You may have to adjust your batch norm momentum to get equivalent results. |
@erfannoury : I replaced the poolinglayer with the expressionlayer as suggested by f0k now. The poolinglayer was really just the first thing I could think of to implement an identity layer with strides, I wasn't aware of expressionlayer. :) @f0k : I noticed that there's been some changes in the final implementation of batch normalization. I trained the models using this batch-normalization code, is there any easy way to convert the models to work with the new batch-normalization? |
Yep, that's what I suggested at first: #38 (comment). The implementation will look a bit different in Theano since it's non-imperative. If you're lucky, it's enough to just specify your loss as a sum of the losses of two half batches so that Theano propagates the two halves separately.
Note that this is only relevant for what they term option (A), while they use option (B) for most experiments (p.6 right column and following). But if it's about reproducing all of the paper, it's good to have everything as efficient as possible!
Yes, I've sketched it here: Lasagne/Lasagne#467 (comment) |
Can backward and forward be performed on different GPUs? I think memory requirement for forward pass is less than the memory requirement for backward pass. |
The backward pass usually relies on information from the forward pass (e.g., computing the gradients wrt. the weights of a dense or convolutional layer requires its original inputs from the forward pass). You could transfer that information over, but then you won't have any memory benefits over computing everything on a single GPU. You could have one GPU access the memory of the other, but that will be slow. In either case, Theano currently doesn't support using multiple GPUs from the same process. |
@auduno |
Hi @okvol, I don't have the training and validation errors per epoch anymore, but I can do a run overnight and give you a plot tomorrow. |
@auduno |
Here's the training and validation errors for a 56-layer network, training error is the dashed line and validation error is the solid line: The validation error looks quite a lot noisier than what can be seen in the paper (Note that before commit 550c781 I incorrectly calculated validation error using the mini-batch mean and std in batch-normalization, so it looked much more similar to the paper). I'm curious if this means there is something wrong in my code, or if this is because the batch-normalization layers uses moving average mean/std instead of properly calculated mean/std over the entire dataset. I'd calculate the validation/training error using properly calculated mean/std for the BN as well, but I couldn't find any recipes for doing this quickly, and I don't have time to put together code for this myself right now. Does anyone know if this exists anywhere? |
Is there anything I need to do to get this PR merged? |
Good question! @ebenolson, any work left to do? This is worth a read by the way, some interesting new results! |
Looks good to me. I will merge this evening if there are no more comments. |
Merging, thanks again! |
Code for reproducing cifar-10 examples in "Deep Residual Learni…
Awesome! Yes, I had a look at the results from @gcr, very interesting! I also noticed his test error curve looks pretty similar to what I'm seeing (i.e. a bit dissimilar from the figure in the paper), so at least there's nothing wrong with my code. I'm guessing there is some non-documented detail of how the test error is evaluated which might account for the difference. I'll update the code if anyone figures out the cause! |
If you're talking about instability in the test error (ie. if it looks too jittery compared to the version in the paper), I have some notes about that. In my case, I was doing a few things wrong that cause artificial noise to seep into the testing error curve at first (the graphs in the current readme should have the following two problems corrected):
Kaiming He from MSRA sent me the following email regarding instability, which I'm pasting with permission (thanks, Kaiming!)
Torch uses an exponentially reweighted batch normalization that uses a running average, so setting a lower momentum should have a similar effect to computing over a larger fraction of the training set. Kaiming has some comments about this (emphasis mine):
I'm not sure how Lasagne does it, but it could be relevant if you want to get an exact reproduction (I did not bother to use this strategy) |
That's very interesting @gcr, thanks for sharing! I have been suspecting that the way the BN statistics are calculated might have to do with the jittery test error. I also had the same issue with relatively large error variation between runs (around 0.5% as you report), the BN statistics might explain this as well. I'll try calculating the BN mean/var over the entire set and see if that results in more stable results! |
We're also doing an exponential moving average, using the same default momentum as Torch (and some other libraries).
You can use the trick of setting the momentum term to |
aha! thanks for the hint, @f0k. i had tried setting BN momentum to if I understand correctly, if your model has |
The idea in the original BN paper was to do this for the final model only, so there wouldn't be any impact on training, just a one-time cost afterwards. The BN authors advocated computing the exponential moving average during training to have something for validation (so you can do early stopping somewhat reliably). Since we're now discussing making the validation error more robust, using a large training batch once will be a better idea than passing all the training data |
@f0k : So if I understand you correctly, when evaluating the validation error every epoch, it is sufficient to do a training pass with large batch size, updating only the batch norm parameters (with momentum 1), then evaluating the validation error as usual? Or in our case something like this: # set learning rate to 0 to not update parameters in training pass
old_lr = sh_lr.get_value()
sh_lr.set_value(lasagne.utils.floatX(0.))
# set momentum to 1 in all BN layers
for l in lasagne.layers.get_all_layers(network):
if l.__class__.__name__ == "BatchNormLayer":
l.alpha = 1.
# do training pass over a large batch of 5000 samples or so
indices = np.arange(100000)
np.random.shuffle(indices)
train_fn(X_train[indices[0:5000],:,:,:], Y_train[indices[0:5000]])
# revert learning rate and BN momentum
sh_lr.set_value(lasagne.utils.floatX(old_lr))
for l in lasagne.layers.get_all_layers(network):
if l.__class__.__name__ == "BatchNormLayer":
l.alpha = 1e-4 |
That's how I interpret @gcr's quotation of Kaiming's email, yes, and it sounds plausible to me!
No, this won't work.
|
There's a follow-up paper now: http://arxiv.org/abs/1603.05027 |
Here's the code for reproducing the cifar-10 examples in "Deep Residual Learning for Image Recognition". The code is based on the MNIST example, feel free to reformat it. Note that it also depends on batch-normalization code in PR #467.
Training a 32-layer network (n=5) with learning parameters similar to the descriptions in the paper, I got validation error 6.88%, which actually is slightly better than the error in the paper. I wanted to try a 56-layer network as well, but currently this fails with error "RuntimeError: maximum recursion depth exceeded while calling a Python object".
There still seems to be some differences between this model and the one used in the paper, as my model seems to have slightly more parameters than what is described in the paper, e.g. the 1202-layer version has 19.6M parameters versus 19.4M in paper. This model also seems to learn a bit slower and more unstably than what I see in figure 6 in the paper, though the final accuracy is similar. It's actually not entirely clear to me what an iteration is in the paper, since they say "trained with a minibatch size of 128 on two GPUs", so I don't know if an iteration is equal to one or two minibatches of size 128. I've assumed an iteration is a single minibatch of 128, but two minibatches of 128 would make learning speed more similar.
Let me know if I should upload the weights for the trained 32-layer model as well, and of course if you discover any discrepancies between this model and what is described in the paper.