You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I finished making a rough version of the bypass model using tf.rnn, and would appreciate your advice on how to proceed. On github (bypass repo): the model is in bypass_rnn.py, and the Convolutional RNNCells used, as well as a FC layer cell, are defined in ConvRNN.py.
I've attached the Tensorboard visualization of the graph as an example reference. It appears small, but you can zoom in a lot. The model doesn't follow any existing base architecture like VGG, but consists of several Convolutional/Pooling cells and a FC layer cell, followed by a fully connected softmax layer. You can see different nodes representing the same cell at different time points, but I checked that the weight variables are indeed being shared.
Some issues:
The training time is faster than the unrolled version of the network I made earlier, but is still quite slow- with a single training iteration on a batch size of 32 (256x256x3 imagenet images) taking ~3 seconds. Increasing the total number of time steps T increases this time.
If I set the RNN to run for too many time steps, like T > 6 (although the actual number depends on the specific architecture), I quickly get an OOM error.
I hypothesize that this is because: 1. Even though the RNN model has a single cell per layer, the values of the cells at each time step must be kept in tf's memory for backpropagation through time. 2. Tensorflow is slightly slower than other frameworks on most models (see the benchmark data on the Jan 5 comment: soumith/convnet-benchmarks#66). 3. It is much slower than regular RNNs because we are using convolutional layers instead of 1d hidden units. What do you think?
Also, all our different runs (adding or removing an extra FC layer at the end, adding or removing bypass, etc.) converge to a loss of ~6.9 after 15 steps, which is basically like random guessing. A search online suggests that after 8K iterations this should go down, if the model is initialized correctly.
How do you suggest we should proceed? The training time seems to be too slow right now. What should we check for performance-wise in the model? How should we, if possible, speed up the training? For example, Jonas and I were thinking about using pre-trained weights for at least the first layer and using distributed GPUs on tensorflow.
For starters, I will try to: 1. time how long it takes for a single pass through a given layer at a given time, as well as how long it takes to load a batch of training images, to try to find the bottleneck. 2. vary initialization parameters like learning rate (currently 0.05). 3. Base case with 100% (non-trainable) decay variable (so no memory of previous state is carried through).
The text was updated successfully, but these errors were encountered:
I agree the key thing is to speed stuff up. The things you suggested doing, esp start with pre-trained network, seem reasonable. But first, can you get away with a smaller model? E.g.:
I notice that you're assuming stride in the conv operation is always 1 (e.g. line 45 of _conv in ConvRNN.py). This is leading to huge models you probably don't need. As stride of at least 2 and possibly even 3 or 4 in the conv operation of the first layer is probably OK and will lead to a much smaller model that will still work. Conv stride of 1 after layer 1 is probably a good idea.
Why not start testing your training algorithms using a model with smaller filterbanks? e.g. max of 64 not e.g. 256? This may not matter than much, but potentially worth trying.
originally written by @marzCS
I finished making a rough version of the bypass model using tf.rnn, and would appreciate your advice on how to proceed. On github (bypass repo): the model is in bypass_rnn.py, and the Convolutional RNNCells used, as well as a FC layer cell, are defined in ConvRNN.py.
I've attached the Tensorboard visualization of the graph as an example reference. It appears small, but you can zoom in a lot. The model doesn't follow any existing base architecture like VGG, but consists of several Convolutional/Pooling cells and a FC layer cell, followed by a fully connected softmax layer. You can see different nodes representing the same cell at different time points, but I checked that the weight variables are indeed being shared.
Some issues:
The training time is faster than the unrolled version of the network I made earlier, but is still quite slow- with a single training iteration on a batch size of 32 (256x256x3 imagenet images) taking ~3 seconds. Increasing the total number of time steps T increases this time.
If I set the RNN to run for too many time steps, like T > 6 (although the actual number depends on the specific architecture), I quickly get an OOM error.
I hypothesize that this is because: 1. Even though the RNN model has a single cell per layer, the values of the cells at each time step must be kept in tf's memory for backpropagation through time. 2. Tensorflow is slightly slower than other frameworks on most models (see the benchmark data on the Jan 5 comment: soumith/convnet-benchmarks#66). 3. It is much slower than regular RNNs because we are using convolutional layers instead of 1d hidden units. What do you think?
Also, all our different runs (adding or removing an extra FC layer at the end, adding or removing bypass, etc.) converge to a loss of ~6.9 after 15 steps, which is basically like random guessing. A search online suggests that after 8K iterations this should go down, if the model is initialized correctly.
How do you suggest we should proceed? The training time seems to be too slow right now. What should we check for performance-wise in the model? How should we, if possible, speed up the training? For example, Jonas and I were thinking about using pre-trained weights for at least the first layer and using distributed GPUs on tensorflow.
For starters, I will try to: 1. time how long it takes for a single pass through a given layer at a given time, as well as how long it takes to load a batch of training images, to try to find the bottleneck. 2. vary initialization parameters like learning rate (currently 0.05). 3. Base case with 100% (non-trainable) decay variable (so no memory of previous state is carried through).
The text was updated successfully, but these errors were encountered: