Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Weird changes in version 1.4 [bug] #495

Closed
ricsinaruto opened this issue Dec 29, 2017 · 13 comments
Closed

Weird changes in version 1.4 [bug] #495

ricsinaruto opened this issue Dec 29, 2017 · 13 comments
Labels

Comments

@ricsinaruto
Copy link
Contributor

There are several weird changes that I have observed by switching from tensor2tensor version 1.3.2 to version 1.4.1. (Tensorflow version is 1.4.1 (gpu) for both)

Running exactly the same t2t-trainer command results in vastly different trainings, and I will list all the weird and annoying differences that I have observed here. I have no idea what the problem could be, I don't know whether it is a bug, or I just have to change some parameters to adapt to the new tensor2tensor version.

The command that I run:

t2t-trainer --t2t_usr_dir=t2t_csaky --generate_data=False --data_dir=data_dir/facebook_ricsibot_character --model=transformer --problems=character_chatbot --hparams_set=transformer_dorka_big_dropout --output_dir=train_dir/trf_big_dropout_facebook_ricsibot_character --train_steps=800000 --keep_checkpoint_max=3 --keep_checkpoint_every_n_hours=1

As you can see I use my own problem and hparam definitions, however this shouldn't affect anything. In my registration files the code is exactly the same for both tensor2tensor versions. Running the above command results in the following changes from version 1.3.2 to 1.4.1:

  • In 1.4.1 I can no longer see any training stats using tensorboard (loss, learning rate, etc.)
    • I can still see eval stats in 1.4.1, but compared to 1.3.2, now there are two eval folders, one named eval, and one named eval_one_pass
  • In 1.4.1 in my output_dir I don't get a flags.txt and hparams.json file compared to 1.3.2
  • In 1.4.1 the training is run at 2000 steps at a time, and when this is finished the model is reloaded.
    • This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)
  • In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2
    • After about 70 steps I get a weird error, but the evaluation still prints metrics.
    • In 1.3.2 the evaluation runs for 10 steps and then prints metrics without any errors.
      weird_evaluation

Despite these differences the actual trainings run the same, so the loss is going down the same way.

@ricsinaruto ricsinaruto changed the title Weird changes in version 1.4 Weird changes in version 1.4 [bug] Dec 29, 2017
@martinpopel
Copy link
Contributor

I have just partial answer about one of the issues:

In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2

That was my idea because I guess most users expect the evaluation is done on the whole dev set. Once you fix the batch_size (and dev set), you can set eval_steps to the number just before the warnings (78 in your case) and this will prevent the warnings.
The warnings are harmless, but of course annoying and it would be better if they can be silenced. Especially if you use more GPUs, they can take several screens (for each evaluation), which clutters the log. The warnings seem to be related to tensorflow/nmt#125 (they were present in older t2t versions if you set eval_steps higher than the actual size of the dev set).

Another thing that clutters the log is that due to the change of the default schedule (train_and_evaluate vs. continuous_train_and_eval) there are several screens of initialization for each evaluation (and subsequent training).

@ricsinaruto
Copy link
Contributor Author

Thanks for the partial answer @martinpopel, and you are right they are warnings, I falsely called them errors in my post. So if I understand correctly in my case 78 steps of evaluation encompasses my whole dev set?

@martinpopel
Copy link
Contributor

@ricsinaruto: Yes. Your dev set has about 78 * batch_size tokens, where tokens can be characters, subwords or words depending on your problem definition.

@rsepassi rsepassi added the bug label Jan 2, 2018
@rolloff
Copy link

rolloff commented Jan 17, 2018

I have the same issue surrounding checkpoints (2 copies are being saved every 2000 steps). @ricsinaruto Did you find a fix?

Also, what is the difference between eval and eval_one_pass ? The loss can be quite different in the beginning for the same run. Metrics like approximate BLEU score are wildly different in the beginning.

loss for eval and eval_one_pass

@martinpopel
Copy link
Contributor

2 copies are being saved every 2000 steps

With --schedule=train (i.e. no internal evaluation because I don't trust approx_bleu) I see just one checkpoint each 2000 steps.
I prefer one checkpoint each hour using --save_checkpoints_secs=3600 -- this option was broken in 1.4.2, but it is fixed in #521.
Unfortunately, the Travis CI builds seem to be completely broken last few days (it seems like an internal Travis bug), so I am not sure when this PR will be merged (and released in new T2T version).

@ricsinaruto
Copy link
Contributor Author

@rolloff I didn't really work on this since then, so no fix. I think that eval is computed on your full validation dataset, and eval_one_pass is just one part of it, like in my original screenshot, my full validation set consists of 78 passes. This is just my assumption though.

@rsepassi
Copy link
Contributor

v1.4.2 should have TensorBoard metrics back and hparams.json and flags.txt

@martinpopel
Copy link
Contributor

So I think now the only unsolved bug in this issue is "This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)".

@rolloff
Copy link

rolloff commented Jan 19, 2018

Yes, to be more clear I am using the default schedule, continuous_train_and_eval. The Tensorflow documentation (https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment) mentions that more checkpoints will be saved:

Due to the different approach this schedule takes, it leads to two differences in resource control. First, the resources (e.g., memory) used by training will be released before evaluation (train_and_evaluate takes double resources). Second, more checkpoints will be saved as a checkpoint is generated at the end of each training iteration.

I would be willing to switch to --schedule=train and give up all evaluation metrics, except I really do care that Validation loss is printed every checkpoint. Also, martinpopel- you are saving a lot less checkpoints than me. In a run of 250,000 steps, I will need to save 125. This run takes me 30 hours on 1 Tesla V100 with batch size 8192. So, you would save 30. Does checkpointing less often save you a lot of time?

@martinpopel
Copy link
Contributor

@rolloff: Of course, I also care about validation loss, or rather BLEU on the dev set. For this purpose I use t2t-bleu and t2t-translate-all, so as a byproduct I also have for each checkpoint one file with a translated dev set (so I can re-evaluate it with other metrics than BLEU: chrF3, BEER, etc. or even look at the most differing n-grams etc. if needed). I do the evaluation in parallel (on another machine), so it does not slow down the training. I prefer to keep the checkpoints of interesting experiments in case I want to re-evaluate on different test set in future, but in general I use --keep_checkpoint_max and --keep_checkpoint_every_n_hours to keep the disk usage reasonable.
Saving a checkpoint of the big models takes about 30 seconds, so this is negligible, but still I think saving it each 10 minutes (or even each 2000 steps) is too often for serious experiments.

@rolloff
Copy link

rolloff commented Jan 19, 2018

Thank you!

@martinpopel
Copy link
Contributor

@ricsinaruto: I suggest to close this issue, in favor of #556.

@ricsinaruto
Copy link
Contributor Author

Yes, for the majority of the issues the solutions are in the comments. The final problem of saving two checkpoints every 2000 steps is continued in #556

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants