Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze layers for transfer learning #3247

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

DanBmh
Copy link
Contributor

@DanBmh DanBmh commented Aug 13, 2020

Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.

These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.

@community-tc-integration
Copy link

No Taskcluster jobs started for this pull request
The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

@lissyx lissyx requested a review from JRMeyer August 13, 2020 13:50
@ftyers
Copy link
Collaborator

ftyers commented Aug 13, 2020

@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure:
Captura de 2020-08-13 15-12-46

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 13, 2020

@ftyers I'm working on it right now:)

But the approach I'm suggesting is a bit different to yours. It's using both steps.

My transfer-learing workflow would look like this:

  1. training with frozen layers
  2. training with all layers (you have to start a new training for this)

Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer

@ftyers
Copy link
Collaborator

ftyers commented Aug 13, 2020

@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

I'd be interested in seeing the results when you have them!

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 13, 2020

So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

Exactly


First test was not as good as planned:
(using my other pr #3245; es_epochs=7; reduce_lr_plateau_epochs=3; es_min_delta=0.9; no augmentation)

Language Dataset Additional Infos Losses Training epochs of best model / Total duration
DE Voxforge frozen transfer-learning, then training all layers Test: 37.707958, Validation: 41.832220 12+3; Time: 42min
DE Voxforge without frozen transfer-learning Test: 36.630890, Validation: 41.208125 7; Time: 28min

Not sure why, maybe some training randomness. Because I don't think this should lead to worse results.

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 14, 2020

I did run another test, this time I tried your approach with dropping and reinitializing the last layer.
(As noted above, I normally don't drop the layer when training German, I just train over the English weights)

Language Dataset Additional Infos Losses Training epochs of best model / Total duration
DE Voxforge dropped last layer Test: 42.516270, Validation: 47.105518 8; Time: 28min
DE Voxforge with frozen transfer-learning in two steps Test: 36.600590, Validation: 40.640134 14+8; Time: 42min

Here you can see an improvement if using the frozen transfer-learning approach.
(Note one the dataset: Voxforge has 31h, I'm using about 5h each for dev+test, rest for training. So it's quite small)


So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network.

Of course we would need to do some more test before we can say this for sure.

@lissyx
Copy link
Collaborator

lissyx commented Aug 14, 2020

Not being judgmental but i think we'll at least wait after 1.0 to merge that

@JRMeyer
Copy link
Contributor

JRMeyer commented Aug 14, 2020

@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well.

So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature.

@JRMeyer
Copy link
Contributor

JRMeyer commented Aug 14, 2020

@DanBmh -- I don't see why you need a new flag for load_frozen_graph. For reference, this is how I implemented transfer-learning + freezing layers before: https://github.com/mozilla/STT/blob/transfer-learning2/DeepSpeech.py#L264-L278

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 14, 2020

I don't see why you need a new flag for `load_frozen_graph

Problem is that when loading the checkpoint for the second training the frozen layers have no variables for Adam. They were not loaded/saved in the first training because they were not used. So we have to reinitialize them.

Copy link
Contributor

@JRMeyer JRMeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanBmh -- Firstly, this is a great addition and I'd like to see it merged!

Here's some issues I have with the current implementation:

(1) I don't foresee any scenario where the frozen layers are not identical to the transferred layers. For example, if I'm transferring four layers from model A, re-initializing the last 2 layers for model B, if I freeze any layers from model A, I would freeze all four of them, not just a subset.

In this case, we can change the freeze_source_layers flag to a boolean, and just use the drop_source_layers flag to deduce which layers to freeze. Otherwise, we are basically duplicating the drop_source_layers flag when we don't have to.

(2) I also really don't like the load_frozen_graph flag, because it requires you to know the training history of some checkpoints. Is there a way to get around this? Can we maybe re-initialize the Adam tensors before the model is saved to disk, after a training run? I think this flag will trip a lot of people up in use. If someone gives me a checkpoint to use as a starting point for further transfer-learning, I don't expect to have to know anything about how those checkpoints were trained.

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 18, 2020

  1. Currently I did use this for my German training (0 dropped, 1 frozen). I think this makes sense for all languages with same alphabet like English and and similar pronunciation. This could also be used for transfer-learning of dialects. Instead of random initialized weights you would get somewhat matching weights at the beginning of the training.

  2. You're right about this one and I think reinitialization after training is a good idea. I hope I can find some time in the next days to test it.

@lissyx
Copy link
Collaborator

lissyx commented Aug 28, 2020

@DanBmh Please dont let that sink, if you have some time to rebase :)

@lissyx
Copy link
Collaborator

lissyx commented Aug 28, 2020

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

@DanBmh
Copy link
Contributor Author

DanBmh commented Aug 29, 2020

@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint.

I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons).

@reuben
Copy link
Contributor

reuben commented Aug 29, 2020

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code.

@DanBmh DanBmh force-pushed the freezed_transfer branch 2 times, most recently from 67082eb to 4b20d71 Compare October 1, 2020 13:27
@DanBmh DanBmh requested a review from JRMeyer October 1, 2020 13:31
@reuben
Copy link
Contributor

reuben commented Oct 19, 2020

@JRMeyer ping

Copy link
Collaborator

@ftyers ftyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've gone over and asked some questions. Some of these might be things that really need to be changed and others are just suggestions. Some of them might also not be valid, e.g. there might be some reasons why you've done it how you have that I am not aware of.

import tensorflow.compat.v1 as tfv1

from .flags import FLAGS
from .logging import log_info, log_error, log_warn
from .logging import log_error, log_info, log_warn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the order here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autosort?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the DS policy is for that, I'd have to ask.

print_msg = True
if print_msg:
msg = "Some Adam tensors are missing, they will be initialized automatically."
log_info(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not do just log_info("...") ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

missing = []
if ...
missing.append(v)

if missing:
  for v in missing:
    log_info("Missing... {}".format(v))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split into two parts was an autoformatting issue, black would have split this into three lines.

I didn't print every missing layer, because there are messages later, that state which layers were reinitialized exactly.

@@ -93,6 +93,7 @@ def create_flags():
# Transfer Learning

f.DEFINE_integer('drop_source_layers', 0, 'single integer for how many layers to drop from source model (to drop just output == 1, drop penultimate and output ==2, etc)')
f.DEFINE_integer('freeze_source_layers', 0, 'use same value as above to freeze the other layers')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is a bit vague, what is it trying to say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intended use is together with the 'drop_source_layers' flag, if we want to drop the last layer, we can freeze the rest of the net by setting 'freeze_source_layers' to the same value of 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could be more explicit in the help text, e.g. say that you have to use the --drop_source_layers. Maybe say something like "freeze N remaining layers after using --drop_source_layers"? But even that is a bit unclear. Try and describe what the flag does, and then maybe give an example of how it can be used together with --drop_source_layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better now?

# Filter out layers if we want to freeze some
if FLAGS.freeze_source_layers > 0:
filter_vars = drop_freeze_number_to_layers(FLAGS.freeze_source_layers, "freeze")
new_train_vars = list(train_vars)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to build up new_train_vars from empty, something like:

new_train_vars = []
for tv in train_vars:
    if tv.name not in filter_vars:
        new_train_vars.append(tv)
train_vars = new_train_vars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed more intuitive, we want to train all except the filtered layers.
Your example doesn't work by the way, because the filter_vars contain names like layer_1 and train_vars have the full layer name layer_1:dense:0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence the "something like" :) -- You could of course use find() or something like that. I have no particularly strong feeling about it, but err on the side of simplicity.

# Compute gradients for model parameters using tower's mini-batch
gradients = optimizer.compute_gradients(avg_loss)
gradients = optimizer.compute_gradients(avg_loss, var_list=train_vars)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like someone else to take a look at this.

@@ -19,32 +19,33 @@ def _load_checkpoint(session, checkpoint_path, allow_drop_layers, allow_lr_init=
# compatibility with older checkpoints.
lr_var = set(v for v in load_vars if v.op.name == 'learning_rate')
if lr_var and ('learning_rate' not in vars_in_ckpt or
(FLAGS.force_initialize_learning_rate and allow_lr_init)):
(FLAGS.force_initialize_learning_rate and allow_lr_init)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacing only change...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing: PEP 8: E127 continuation line over-indented for visual indent

Copy link

@ahmed82 ahmed82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we replacing the TensorFlow ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants