Weird changes in version 1.4 [bug] #495

ricsinaruto · 2017-12-29T10:37:09Z

There are several weird changes that I have observed by switching from tensor2tensor version 1.3.2 to version 1.4.1. (Tensorflow version is 1.4.1 (gpu) for both)

Running exactly the same t2t-trainer command results in vastly different trainings, and I will list all the weird and annoying differences that I have observed here. I have no idea what the problem could be, I don't know whether it is a bug, or I just have to change some parameters to adapt to the new tensor2tensor version.

The command that I run:

t2t-trainer --t2t_usr_dir=t2t_csaky --generate_data=False --data_dir=data_dir/facebook_ricsibot_character --model=transformer --problems=character_chatbot --hparams_set=transformer_dorka_big_dropout --output_dir=train_dir/trf_big_dropout_facebook_ricsibot_character --train_steps=800000 --keep_checkpoint_max=3 --keep_checkpoint_every_n_hours=1

As you can see I use my own problem and hparam definitions, however this shouldn't affect anything. In my registration files the code is exactly the same for both tensor2tensor versions. Running the above command results in the following changes from version 1.3.2 to 1.4.1:

In 1.4.1 I can no longer see any training stats using tensorboard (loss, learning rate, etc.)
- I can still see eval stats in 1.4.1, but compared to 1.3.2, now there are two eval folders, one named eval, and one named eval_one_pass
In 1.4.1 in my output_dir I don't get a flags.txt and hparams.json file compared to 1.3.2
In 1.4.1 the training is run at 2000 steps at a time, and when this is finished the model is reloaded.
- This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)
In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2
- After about 70 steps I get a weird error, but the evaluation still prints metrics.
- In 1.3.2 the evaluation runs for 10 steps and then prints metrics without any errors.

Despite these differences the actual trainings run the same, so the loss is going down the same way.

The text was updated successfully, but these errors were encountered:

martinpopel · 2017-12-29T10:54:51Z

I have just partial answer about one of the issues:

In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2

That was my idea because I guess most users expect the evaluation is done on the whole dev set. Once you fix the batch_size (and dev set), you can set eval_steps to the number just before the warnings (78 in your case) and this will prevent the warnings.
The warnings are harmless, but of course annoying and it would be better if they can be silenced. Especially if you use more GPUs, they can take several screens (for each evaluation), which clutters the log. The warnings seem to be related to tensorflow/nmt#125 (they were present in older t2t versions if you set eval_steps higher than the actual size of the dev set).

Another thing that clutters the log is that due to the change of the default schedule (train_and_evaluate vs. continuous_train_and_eval) there are several screens of initialization for each evaluation (and subsequent training).

ricsinaruto · 2017-12-29T11:13:43Z

Thanks for the partial answer @martinpopel, and you are right they are warnings, I falsely called them errors in my post. So if I understand correctly in my case 78 steps of evaluation encompasses my whole dev set?

martinpopel · 2017-12-29T12:59:47Z

@ricsinaruto: Yes. Your dev set has about 78 * batch_size tokens, where tokens can be characters, subwords or words depending on your problem definition.

rolloff · 2018-01-17T21:32:57Z

I have the same issue surrounding checkpoints (2 copies are being saved every 2000 steps). @ricsinaruto Did you find a fix?

Also, what is the difference between eval and eval_one_pass ? The loss can be quite different in the beginning for the same run. Metrics like approximate BLEU score are wildly different in the beginning.

martinpopel · 2018-01-17T22:27:29Z

2 copies are being saved every 2000 steps

With --schedule=train (i.e. no internal evaluation because I don't trust approx_bleu) I see just one checkpoint each 2000 steps.
I prefer one checkpoint each hour using --save_checkpoints_secs=3600 -- this option was broken in 1.4.2, but it is fixed in #521.
Unfortunately, the Travis CI builds seem to be completely broken last few days (it seems like an internal Travis bug), so I am not sure when this PR will be merged (and released in new T2T version).

ricsinaruto · 2018-01-17T22:50:43Z

@rolloff I didn't really work on this since then, so no fix. I think that eval is computed on your full validation dataset, and eval_one_pass is just one part of it, like in my original screenshot, my full validation set consists of 78 passes. This is just my assumption though.

rsepassi · 2018-01-18T18:13:13Z

v1.4.2 should have TensorBoard metrics back and hparams.json and flags.txt

martinpopel · 2018-01-19T14:09:39Z

So I think now the only unsolved bug in this issue is "This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)".

rolloff · 2018-01-19T16:01:46Z

Yes, to be more clear I am using the default schedule, continuous_train_and_eval. The Tensorflow documentation (https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment) mentions that more checkpoints will be saved:

Due to the different approach this schedule takes, it leads to two differences in resource control. First, the resources (e.g., memory) used by training will be released before evaluation (train_and_evaluate takes double resources). Second, more checkpoints will be saved as a checkpoint is generated at the end of each training iteration.

I would be willing to switch to --schedule=train and give up all evaluation metrics, except I really do care that Validation loss is printed every checkpoint. Also, martinpopel- you are saving a lot less checkpoints than me. In a run of 250,000 steps, I will need to save 125. This run takes me 30 hours on 1 Tesla V100 with batch size 8192. So, you would save 30. Does checkpointing less often save you a lot of time?

martinpopel · 2018-01-19T17:01:29Z

@rolloff: Of course, I also care about validation loss, or rather BLEU on the dev set. For this purpose I use t2t-bleu and t2t-translate-all, so as a byproduct I also have for each checkpoint one file with a translated dev set (so I can re-evaluate it with other metrics than BLEU: chrF3, BEER, etc. or even look at the most differing n-grams etc. if needed). I do the evaluation in parallel (on another machine), so it does not slow down the training. I prefer to keep the checkpoints of interesting experiments in case I want to re-evaluate on different test set in future, but in general I use --keep_checkpoint_max and --keep_checkpoint_every_n_hours to keep the disk usage reasonable.
Saving a checkpoint of the big models takes about 30 seconds, so this is negligible, but still I think saving it each 10 minutes (or even each 2000 steps) is too often for serious experiments.

rolloff · 2018-01-19T17:13:29Z

Thank you!

martinpopel · 2018-02-06T17:11:39Z

@ricsinaruto: I suggest to close this issue, in favor of #556.

ricsinaruto · 2018-02-06T18:03:07Z

Yes, for the majority of the issues the solutions are in the comments. The final problem of saving two checkpoints every 2000 steps is continued in #556

ricsinaruto changed the title ~~Weird changes in version 1.4~~ Weird changes in version 1.4 [bug] Dec 29, 2017

rsepassi added the bug label Jan 2, 2018

martinpopel mentioned this issue Jan 22, 2018

saving checkpoint after restoring #534

Closed

This was referenced Feb 2, 2018

Evaluation failed when training with multiple work gpus #266

Closed

continuous_train_and_eval makes BLEU worse (saving two checkpoints instead of one) #556

Open

ricsinaruto closed this as completed Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird changes in version 1.4 [bug] #495

Weird changes in version 1.4 [bug] #495

ricsinaruto commented Dec 29, 2017

martinpopel commented Dec 29, 2017

ricsinaruto commented Dec 29, 2017

martinpopel commented Dec 29, 2017

rolloff commented Jan 17, 2018 •

edited

Loading

martinpopel commented Jan 17, 2018

ricsinaruto commented Jan 17, 2018

rsepassi commented Jan 18, 2018

martinpopel commented Jan 19, 2018

rolloff commented Jan 19, 2018 •

edited

Loading

martinpopel commented Jan 19, 2018

rolloff commented Jan 19, 2018

martinpopel commented Feb 6, 2018

ricsinaruto commented Feb 6, 2018

Weird changes in version 1.4 [bug] #495

Weird changes in version 1.4 [bug] #495

Comments

ricsinaruto commented Dec 29, 2017

martinpopel commented Dec 29, 2017

ricsinaruto commented Dec 29, 2017

martinpopel commented Dec 29, 2017

rolloff commented Jan 17, 2018 • edited Loading

martinpopel commented Jan 17, 2018

ricsinaruto commented Jan 17, 2018

rsepassi commented Jan 18, 2018

martinpopel commented Jan 19, 2018

rolloff commented Jan 19, 2018 • edited Loading

martinpopel commented Jan 19, 2018

rolloff commented Jan 19, 2018

martinpopel commented Feb 6, 2018

ricsinaruto commented Feb 6, 2018

rolloff commented Jan 17, 2018 •

edited

Loading

rolloff commented Jan 19, 2018 •

edited

Loading