-
Notifications
You must be signed in to change notification settings - Fork 3.5k
How to measure the number of training epochs #415
Comments
Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture. Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script. |
@rsepassi hi, "batch_size" is the number of subwords of source and target sentences in a batch? |
@yuimo: it is the maximum of source and target subwords, for each sentence. See tensor2tensor/tensor2tensor/utils/data_reader.py Lines 145 to 153 in 92983ea
|
@martinpopel i got it, thanks a lot |
@martinpopel , you meant to write: epochs = steps * batch_size * worker_gpu / training_subwords Right? |
Yes, exactly. In other words I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR. |
@martinpopel It would be nice if the T2T will show in the tensorboard how many epochs were done during training. Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks? In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?) |
Yes, that would be nice, but there are two problems:
The standard&naive answer is "until converged on dev set", but this is difficult to measure (how to set early stopping parameters) and achieve. My training data has about half a gigaword and even 18 epochs (11 days of training on 8 GPUs) were not enough to reach the highest possible BLEU.
It should be randomized and deterministic (thanks to the fixed rand seed), but I am waiting for the ultimate answer from the T2T authors, see #556 (comment) and the posts below. |
@martinpopel, why not to go with simply the % of sentences (examples) completed, instead of subwords? If we completed 100% of the cases in the training data -> We reached to 1.0 epochs, and so on? |
@NadavB T2T computes |
I probably don't understand something. Why do we care about subwords when we talk about epochs? |
Yes, you can implement such counter and send a PR. That would be great (and more precise than my subword-based estimates that are biased because of not taking into account zero-padding). |
Where should I modify the code about set training steps? |
I don't know where is the exact location for adding the epoch counter (but I have not spent much time searching it), otherwise I would do it myself. Maybe it is possible to solve it with a hook in |
@martinpopel With "number of training subwords" you mean the sum of all source texts subwords plus all target texts subwords used for training, is that right? |
@DonPex: No. It is the maximum of source and target subwords, for each sentence. See the discussion above. |
@martinpopel Thank you. I used your script to compute the maximum number of subwords, but you said that it should be only an estimate because of padding tokens. I am using Google Colab, so I would know if it's possible to train a Transformer at least one epoch in 12 hours (maximum time allowed on Colab) with a custom dataset using a specific batch size. |
Yes, considering nonpadding_fraction should result in a more precise estimate. |
My target is just to feed the model with the highest number of subwords in the dataset possible, so if I couldn't complete one epoch in less than 12h, I should use another Colab session and start the training from another random part, in this way I may skip some fractions of the dataset due to randomness. |
@martinpopel may I ask something regarding the above formula? When training on a single TPU (v2), is the effective_batch_size equal to the batch_size, or to |
@coder1248: I would guess |
Thanks again for your help martinpopel! |
In order to compare with other NMT frameworks, I would like to know how many training epochs (i.e. passes over the whole training data) are done at the moment.
I can see the number of training (global) steps and I guess epochs = steps * batch_size / training_subwords.
So the questions boils down to: How to make T2T report (e.g. in the log) the number of subwords in the training data?
The text was updated successfully, but these errors were encountered: