Cannot train properly #148

xz-keg · 2017-07-12T23:44:24Z

The example model tend to get a good result, but turn out to transform badly when I follow the exact steps of it. What is the case that the model cannot get a performance?

INFO:tensorflow:Inference results INPUT: Protesters on Black Friday demanded a salary increase and complained that the cost of medical insurance provided by the corporation went from 30 to 100 dollars a month.
INFO:tensorflow:Inference results OUTPUT: Das bedeutet, dass die meisten unserer Mitarbeiter und Mitarbeiter in der Lage sein werden, ihre Aufgaben zu erfüllen.
INFO:tensorflow:Inference results INPUT: Among these projects, he mentioned that five are in Peru and are located in the transverse axes of its territory, between the coast and Brazil, and two focus on increased connection with Ecuador, although he gave no further details.
INFO:tensorflow:Inference results OUTPUT: Das Hotel befindet sich in der Nähe des Hotels, nur wenige Minuten von der U-Bahnstation entfernt.

lukaszkaiser · 2017-07-13T00:00:04Z

How is your decoding set up? What's the decoding_beam_size and decoding_alpha? What were the eval results from your model actually, what do you mean by "good"?

xz-keg · 2017-07-13T00:19:50Z

I decode using the parameters on the guide (BEAM_SIZE=4
ALPHA=0.6), but it outputs nothing when decoding "hello world" or "goodbye world".

What I'm worrying about is that the trained model only output random sequences instead of translation. The below are examples outputted when the training process finished.

INFO:tensorflow:Inference results INPUT: Protesters on Black Friday demanded a salary increase and complained that the cost of medical insurance provided by the corporation went from 30 to 100 dollars a month.
INFO:tensorflow:Inference results OUTPUT: Das bedeutet, dass die meisten unserer Mitarbeiter und Mitarbeiter in der Lage sein werden, ihre Aufgaben zu erfüllen.
INFO:tensorflow:Inference results INPUT: Among these projects, he mentioned that five are in Peru and are located in the transverse axes of its territory, between the coast and Brazil, and two focus on increased connection with Ecuador, although he gave no further details.
INFO:tensorflow:Inference results OUTPUT: Das Hotel befindet sich in der Nähe des Hotels, nur wenige Minuten von der U-Bahnstation entfernt

lukaszkaiser · 2017-07-13T00:31:29Z

What were your eval scores (--eval_steps=10 --train_steps=0)?

xz-keg · 2017-07-13T00:34:34Z

A little higher than totally random, I think.

lukaszkaiser · 2017-07-13T02:22:36Z

Then it's no wonder the inference doesn't work. You first need to train your model and check the evals.

xz-keg · 2017-07-13T03:18:01Z

Yes, I just wonder why exactly following the guidance only leads to such result. It seems that the model trained for 250k iterations.

tobyyouup · 2017-07-13T04:07:05Z

@lukaszkaiser I also found this problem. I have already trained a good model with BLEU score 26.x. But When I upgrade the t2t version and train a tranformer_big model，it encounter two problems: (1) the first problem is the score is 0 when doing eval, the same with issue #121 (2) the second problem is the same with this issue as described by @aviczhl2. Test decoding means nothing, sometimes just empty for some sentences.

I think this problems are not occasional, because different people both encounter this problems. So is there any ideas ?

lukaszkaiser · 2017-07-15T01:23:01Z

It might be related to the unicode issues we were trying to correct with python3. I think you need to remove your vocab files and data and re-generate them. It should be worth trying with 1.0.14, I'm re-running now and it looks not-random, but too early to be sure about the result.

xz-keg · 2017-07-15T18:21:41Z

Re-generation makes sense. Thanks very much.

mainakchain · 2018-06-14T08:41:25Z

I created a new summarizing problem with a new dataset of my own. I followed new problem with training walkthrough, but whenever I start training all we get in the train directory are:

events.out.tfevents.*
flags.txt
flags_t2t.txt
graph.pbtxt
hparams.json

After that no model or checkpoint is saved. In other words, my problem is not training on the generated data using t2t-datagen. Can anyone please guide me here @lukaszkaiser @aviczhl2.

afrozenator · 2018-06-14T14:42:05Z

Hi @mainakchain -- Just to understand you correctly: you are not saying that that tutorial doesn't train -- you implemented your own problem class and that fails to train?

Also what do the logs of t2t-trainer say -- maybe the training is just slow?

mainakchain · 2018-06-15T16:50:43Z

Hi @afrozenator I have been at this for some days. I am not able to train my summarization model. At first I created a new problem, and tried registering and training using it. Recently, I shaped my data in the summarize_cnn_dailymail32k problem format and tried to train it with the predefined cnn_dailymail problem. But, it never outputs more than the above 5 files. (When I train other problems, I happen to get files like checkpoint, model.ckpt.* sometime after my training starts using t2t-trainer, which is so obvious).

To my surprise, the problem training on my data (even with predefined daily mail problem) consumes all the cores of my system and even takes up most of my GPU RAM, without giving any info neither on the screen nor as output files. For getting a deeper understanding, I tried to put up the output_dir on tensorboard. All it showed was a graph of transformer architecture. The projector, scalar of other curves are not formed at all.

Any kind of help is highly appreciated. @lukaszkaiser @aviczhl2

lukaszkaiser closed this as completed Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot train properly #148

Cannot train properly #148

xz-keg commented Jul 12, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

tobyyouup commented Jul 13, 2017

lukaszkaiser commented Jul 15, 2017

xz-keg commented Jul 15, 2017

mainakchain commented Jun 14, 2018 •

edited

Loading

afrozenator commented Jun 14, 2018 •

edited

Loading

mainakchain commented Jun 15, 2018

Cannot train properly #148

Cannot train properly #148

Comments

xz-keg commented Jul 12, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

lukaszkaiser commented Jul 13, 2017

xz-keg commented Jul 13, 2017

tobyyouup commented Jul 13, 2017

lukaszkaiser commented Jul 15, 2017

xz-keg commented Jul 15, 2017

mainakchain commented Jun 14, 2018 • edited Loading

afrozenator commented Jun 14, 2018 • edited Loading

mainakchain commented Jun 15, 2018

mainakchain commented Jun 14, 2018 •

edited

Loading

afrozenator commented Jun 14, 2018 •

edited

Loading