How to measure the number of training epochs #415

martinpopel · 2017-11-13T16:45:04Z

In order to compare with other NMT frameworks, I would like to know how many training epochs (i.e. passes over the whole training data) are done at the moment.
I can see the number of training (global) steps and I guess epochs = steps * batch_size / training_subwords.
So the questions boils down to: How to make T2T report (e.g. in the log) the number of subwords in the training data?

rsepassi · 2017-11-13T22:11:18Z

Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.

Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.

yuimo · 2017-11-29T03:54:12Z

@rsepassi hi, "batch_size" is the number of subwords of source and target sentences in a batch?
or only the number of subwords of source sentences in a batch?
thanks a lot.

martinpopel · 2017-11-29T12:11:46Z

@yuimo: it is the maximum of source and target subwords, for each sentence. See

tensor2tensor/tensor2tensor/utils/data_reader.py

Lines 145 to 153 in 92983ea

    
           def _example_length(example): 
        
             length = 0 
        
             # Length of the example is the maximum length of the feature lengths 
        
             for v in example.values(): 
        
               # For images the sequence length is the size of the spatial dimensions. 
        
               feature_length = (tf.shape(v)[0] if len(v.get_shape()) < 3 else 
        
                                 tf.shape(v)[0] * tf.shape(v)[1]) 
        
               length = tf.maximum(length, feature_length) 
        
             return length

yuimo · 2017-11-30T01:32:56Z

@martinpopel i got it, thanks a lot

ndvbd · 2018-02-13T09:27:22Z

@martinpopel , you meant to write:

epochs = steps * batch_size * worker_gpu / training_subwords

Right?

martinpopel · 2018-02-13T10:23:23Z

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

ndvbd · 2018-02-13T14:14:41Z

@martinpopel It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

martinpopel · 2018-02-13T16:41:19Z

It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Yes, that would be nice, but there are two problems:

How to compute the number of epochs exactly? The formula above does not handle zero-padding, so it is just an upper bound on the number of epochs (I think). TensorBoard reports input_stats/targets_nonpadding_fraction and input_stats/inputs_nonpadding_fraction, so there is a way how to compute the number of epochs. Ideally t2t-datagen should report to stderr the number of subwords (as my script does) and t2t-trainer should report the number of epochs (or how many steps are in one epoch after the first epoch has ended).
How to present this number in TensorBoard? Currently, TensorBoard offers just "Step", "Relative" and "Wall" as the options for the x-axis and I doubt there is a way to provide other options (maybe plugins?). Also, I am not sure what is more helpful: epochs or number of training examples? For a given training data, these two options don't change the curves, just the x-axis labels, but when comparing experiments with different training data size, I guess the number of training examples is more relevant.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

The standard&naive answer is "until converged on dev set", but this is difficult to measure (how to set early stopping parameters) and achieve. My training data has about half a gigaword and even 18 epochs (11 days of training on 8 GPUs) were not enough to reach the highest possible BLEU.

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

It should be randomized and deterministic (thanks to the fixed rand seed), but I am waiting for the ultimate answer from the T2T authors, see #556 (comment) and the posts below.

ndvbd · 2018-02-21T18:26:18Z

@martinpopel, why not to go with simply the % of sentences (examples) completed, instead of subwords?

If we completed 100% of the cases in the training data -> We reached to 1.0 epochs, and so on?
I don't think we need to go into the subwords resolution.

martinpopel · 2018-02-21T20:51:14Z

@NadavB T2T computes batch_size in subwords (for translation problems with variable length). One batch may contain a small number of long sentences or a high number of short sentences.
T2T does not report the number of sentences processed, it reports just the number of steps (batches).
Thus, we need to know the total number of subwords in the training data, in order to estimate the number of epochs.
Of course, if you know the total number of sentences in the training data, you could estimate that x % of sentences are processed when x % of subwords are processed.

ndvbd · 2018-02-27T10:41:41Z

I probably don't understand something. Why do we care about subwords when we talk about epochs?
In the training data, we have input and output sentences.
During training, these sentences are being converted to subwords and then being sent to the different GPUs. The code that takes these sentences know how many sentences it took (and converted to subwords) in each step. We can simply have a counter counting the number of sentences passed. It must be somewhere anyhow, in order not to process a sentence twice. (Some Data Reader). That's it. I don't understand why it is so difficult to keep track on how many sentences we read from the training files. I understand that "One batch may contain a small number of long sentences or a high number of short sentences." - but we don't care how many sentences are in a batch. We only want to know how many sentences we took from the training data set before we converted them and send them to the GPU, hold a counter, and that's it.

martinpopel · 2018-02-27T11:08:06Z

We can simply have a counter counting the number of sentences passed.

Yes, you can implement such counter and send a PR. That would be great (and more precise than my subword-based estimates that are biased because of not taking into account zero-padding).

prigioni · 2018-04-21T08:46:19Z

Where should I modify the code about set training steps?

martinpopel · 2018-04-21T11:36:17Z

I don't know where is the exact location for adding the epoch counter (but I have not spent much time searching it), otherwise I would do it myself. Maybe it is possible to solve it with a hook in utils/trainer_lib.py. Note that tf.contrib.learn.Experiment is deprecated and should be replaced soon, but it seems that tf.estimator does not support continuous_train_and_eval. As this schedule is not intended for distributed train&eval anyway, I would suggest to get rid of tf.contrib.learn.Experiment and reimplement in pure Tensorflow, where it is much easier to count the number of epochs.

DonPex · 2018-09-06T14:45:43Z

@martinpopel With "number of training subwords" you mean the sum of all source texts subwords plus all target texts subwords used for training, is that right?

martinpopel · 2018-09-06T18:24:57Z

@DonPex: No. It is the maximum of source and target subwords, for each sentence. See the discussion above.

DonPex · 2018-09-07T09:39:51Z

@martinpopel Thank you. I used your script to compute the maximum number of subwords, but you said that it should be only an estimate because of padding tokens.
So I should check this input_stats/inputs_nonpadding_fraction and input_stats/targets_nonpadding_fraction, multiplying them with the number of subwords to obtain the real number of subwords without padding?

I am using Google Colab, so I would know if it's possible to train a Transformer at least one epoch in 12 hours (maximum time allowed on Colab) with a custom dataset using a specific batch size.

martinpopel · 2018-09-07T15:40:32Z

Yes, considering nonpadding_fraction should result in a more precise estimate.
I am not sure why "at least one epoch" is important in your use case. Usually you need more epochs anyway for good results (unless the task is simple and data large, in which case you may overfit well before reaching one epoch).
If you can store checkpoints and continue training in another Colab session, then you can try it anyway (T2T starts from a random part of the training data and shuffles the training files by default, I think).

DonPex · 2018-09-07T15:45:27Z

My target is just to feed the model with the highest number of subwords in the dataset possible, so if I couldn't complete one epoch in less than 12h, I should use another Colab session and start the training from another random part, in this way I may skip some fractions of the dataset due to randomness.

coder1248 · 2020-01-07T10:21:58Z

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

@martinpopel may I ask something regarding the above formula? When training on a single TPU (v2), is the effective_batch_size equal to the batch_size, or to batch_size*8?
In other words, a single TPU has 8 cores. If batch_size is 2048, does it mean that each core handles 2048 (so effective_batch_size is 2048*8), or this 2048 is splitted between the cores?
Thank you!

martinpopel · 2020-01-07T14:15:39Z

@coder1248: I would guess batch_size*8, but I am not sure as I have never used TPUs for real training. I know, T2T treats TPUs differently than CPU and GPU in several aspects (e.g. preferring/requiring fixed number of sentences per batch and "packed" problems, which perhaps influences also the estimate of number of epochs).

coder1248 · 2020-01-07T14:21:37Z

Thanks again for your help martinpopel!
@rsepassi @lukaszkaiser could you kindly verify martinpopel's answer? Thanks!

martinpopel mentioned this issue Nov 13, 2017

Is there a way to know the exact BPE components each word is decomposed into? #408

Closed

rsepassi added the feature request label Nov 13, 2017

martinpopel mentioned this issue Nov 29, 2017

GPU usage #17

Closed

martinpopel mentioned this issue Feb 15, 2018

Bug introduced in v1.3.0 causing training divergence #529

Open

martinpopel mentioned this issue Apr 20, 2018

How to set the number of epoch while training a machine translation？ #733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to measure the number of training epochs #415

How to measure the number of training epochs #415

martinpopel commented Nov 13, 2017

rsepassi commented Nov 13, 2017

yuimo commented Nov 29, 2017

martinpopel commented Nov 29, 2017

yuimo commented Nov 30, 2017

ndvbd commented Feb 13, 2018

martinpopel commented Feb 13, 2018

ndvbd commented Feb 13, 2018 •

edited

Loading

martinpopel commented Feb 13, 2018

ndvbd commented Feb 21, 2018

martinpopel commented Feb 21, 2018

ndvbd commented Feb 27, 2018

martinpopel commented Feb 27, 2018

prigioni commented Apr 21, 2018

martinpopel commented Apr 21, 2018

DonPex commented Sep 6, 2018 •

edited

Loading

martinpopel commented Sep 6, 2018

DonPex commented Sep 7, 2018

martinpopel commented Sep 7, 2018 •

edited

Loading

DonPex commented Sep 7, 2018 •

edited

Loading

coder1248 commented Jan 7, 2020 •

edited

Loading

martinpopel commented Jan 7, 2020

coder1248 commented Jan 7, 2020

How to measure the number of training epochs #415

How to measure the number of training epochs #415

Comments

martinpopel commented Nov 13, 2017

rsepassi commented Nov 13, 2017

yuimo commented Nov 29, 2017

martinpopel commented Nov 29, 2017

yuimo commented Nov 30, 2017

ndvbd commented Feb 13, 2018

martinpopel commented Feb 13, 2018

ndvbd commented Feb 13, 2018 • edited Loading

martinpopel commented Feb 13, 2018

ndvbd commented Feb 21, 2018

martinpopel commented Feb 21, 2018

ndvbd commented Feb 27, 2018

martinpopel commented Feb 27, 2018

prigioni commented Apr 21, 2018

martinpopel commented Apr 21, 2018

DonPex commented Sep 6, 2018 • edited Loading

martinpopel commented Sep 6, 2018

DonPex commented Sep 7, 2018

martinpopel commented Sep 7, 2018 • edited Loading

DonPex commented Sep 7, 2018 • edited Loading

coder1248 commented Jan 7, 2020 • edited Loading

martinpopel commented Jan 7, 2020

coder1248 commented Jan 7, 2020

ndvbd commented Feb 13, 2018 •

edited

Loading

DonPex commented Sep 6, 2018 •

edited

Loading

martinpopel commented Sep 7, 2018 •

edited

Loading

DonPex commented Sep 7, 2018 •

edited

Loading

coder1248 commented Jan 7, 2020 •

edited

Loading