-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Weird changes in version 1.4 [bug] #495
Comments
I have just partial answer about one of the issues:
That was my idea because I guess most users expect the evaluation is done on the whole dev set. Once you fix the batch_size (and dev set), you can set eval_steps to the number just before the warnings (78 in your case) and this will prevent the warnings. Another thing that clutters the log is that due to the change of the default schedule ( |
Thanks for the partial answer @martinpopel, and you are right they are warnings, I falsely called them errors in my post. So if I understand correctly in my case 78 steps of evaluation encompasses my whole dev set? |
@ricsinaruto: Yes. Your dev set has about 78 * batch_size tokens, where tokens can be characters, subwords or words depending on your problem definition. |
I have the same issue surrounding checkpoints (2 copies are being saved every 2000 steps). @ricsinaruto Did you find a fix? Also, what is the difference between eval and eval_one_pass ? The loss can be quite different in the beginning for the same run. Metrics like approximate BLEU score are wildly different in the beginning. |
With |
@rolloff I didn't really work on this since then, so no fix. I think that eval is computed on your full validation dataset, and eval_one_pass is just one part of it, like in my original screenshot, my full validation set consists of 78 passes. This is just my assumption though. |
v1.4.2 should have TensorBoard metrics back and hparams.json and flags.txt |
So I think now the only unsolved bug in this issue is "This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)". |
Yes, to be more clear I am using the default schedule, continuous_train_and_eval. The Tensorflow documentation (https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment) mentions that more checkpoints will be saved:
I would be willing to switch to --schedule=train and give up all evaluation metrics, except I really do care that Validation loss is printed every checkpoint. Also, martinpopel- you are saving a lot less checkpoints than me. In a run of 250,000 steps, I will need to save 125. This run takes me 30 hours on 1 Tesla V100 with batch size 8192. So, you would save 30. Does checkpointing less often save you a lot of time? |
@rolloff: Of course, I also care about validation loss, or rather BLEU on the dev set. For this purpose I use |
Thank you! |
@ricsinaruto: I suggest to close this issue, in favor of #556. |
Yes, for the majority of the issues the solutions are in the comments. The final problem of saving two checkpoints every 2000 steps is continued in #556 |
There are several weird changes that I have observed by switching from tensor2tensor version 1.3.2 to version 1.4.1. (Tensorflow version is 1.4.1 (gpu) for both)
Running exactly the same t2t-trainer command results in vastly different trainings, and I will list all the weird and annoying differences that I have observed here. I have no idea what the problem could be, I don't know whether it is a bug, or I just have to change some parameters to adapt to the new tensor2tensor version.
The command that I run:
As you can see I use my own problem and hparam definitions, however this shouldn't affect anything. In my registration files the code is exactly the same for both tensor2tensor versions. Running the above command results in the following changes from version 1.3.2 to 1.4.1:
Despite these differences the actual trainings run the same, so the loss is going down the same way.
The text was updated successfully, but these errors were encountered: