Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Cannot train properly #148

Closed
xz-keg opened this issue Jul 12, 2017 · 12 comments
Closed

Cannot train properly #148

xz-keg opened this issue Jul 12, 2017 · 12 comments

Comments

@xz-keg
Copy link

xz-keg commented Jul 12, 2017

The example model tend to get a good result, but turn out to transform badly when I follow the exact steps of it. What is the case that the model cannot get a performance?

INFO:tensorflow:Inference results INPUT: Protesters on Black Friday demanded a salary increase and complained that the cost of medical insurance provided by the corporation went from 30 to 100 dollars a month.
INFO:tensorflow:Inference results OUTPUT: Das bedeutet, dass die meisten unserer Mitarbeiter und Mitarbeiter in der Lage sein werden, ihre Aufgaben zu erfüllen.
INFO:tensorflow:Inference results INPUT: Among these projects, he mentioned that five are in Peru and are located in the transverse axes of its territory, between the coast and Brazil, and two focus on increased connection with Ecuador, although he gave no further details.
INFO:tensorflow:Inference results OUTPUT: Das Hotel befindet sich in der Nähe des Hotels, nur wenige Minuten von der U-Bahnstation entfernt.

@lukaszkaiser
Copy link
Contributor

How is your decoding set up? What's the decoding_beam_size and decoding_alpha? What were the eval results from your model actually, what do you mean by "good"?

@xz-keg
Copy link
Author

xz-keg commented Jul 13, 2017

I decode using the parameters on the guide (BEAM_SIZE=4
ALPHA=0.6), but it outputs nothing when decoding "hello world" or "goodbye world".

What I'm worrying about is that the trained model only output random sequences instead of translation. The below are examples outputted when the training process finished.

INFO:tensorflow:Inference results INPUT: Protesters on Black Friday demanded a salary increase and complained that the cost of medical insurance provided by the corporation went from 30 to 100 dollars a month.
INFO:tensorflow:Inference results OUTPUT: Das bedeutet, dass die meisten unserer Mitarbeiter und Mitarbeiter in der Lage sein werden, ihre Aufgaben zu erfüllen.
INFO:tensorflow:Inference results INPUT: Among these projects, he mentioned that five are in Peru and are located in the transverse axes of its territory, between the coast and Brazil, and two focus on increased connection with Ecuador, although he gave no further details.
INFO:tensorflow:Inference results OUTPUT: Das Hotel befindet sich in der Nähe des Hotels, nur wenige Minuten von der U-Bahnstation entfernt

@lukaszkaiser
Copy link
Contributor

What were your eval scores (--eval_steps=10 --train_steps=0)?

@xz-keg
Copy link
Author

xz-keg commented Jul 13, 2017

A little higher than totally random, I think.

@lukaszkaiser
Copy link
Contributor

Then it's no wonder the inference doesn't work. You first need to train your model and check the evals.

@xz-keg
Copy link
Author

xz-keg commented Jul 13, 2017

Yes, I just wonder why exactly following the guidance only leads to such result. It seems that the model trained for 250k iterations.

@tobyyouup
Copy link

@lukaszkaiser I also found this problem. I have already trained a good model with BLEU score 26.x. But When I upgrade the t2t version and train a tranformer_big model,it encounter two problems: (1) the first problem is the score is 0 when doing eval, the same with issue #121 (2) the second problem is the same with this issue as described by @aviczhl2. Test decoding means nothing, sometimes just empty for some sentences.

I think this problems are not occasional, because different people both encounter this problems. So is there any ideas ?

@lukaszkaiser
Copy link
Contributor

It might be related to the unicode issues we were trying to correct with python3. I think you need to remove your vocab files and data and re-generate them. It should be worth trying with 1.0.14, I'm re-running now and it looks not-random, but too early to be sure about the result.

@xz-keg
Copy link
Author

xz-keg commented Jul 15, 2017

Re-generation makes sense. Thanks very much.

@mainakchain
Copy link

mainakchain commented Jun 14, 2018

I created a new summarizing problem with a new dataset of my own. I followed new problem with training walkthrough, but whenever I start training all we get in the train directory are:

  • events.out.tfevents.*
  • flags.txt
  • flags_t2t.txt
  • graph.pbtxt
  • hparams.json

After that no model or checkpoint is saved. In other words, my problem is not training on the generated data using t2t-datagen. Can anyone please guide me here @lukaszkaiser @aviczhl2.

@afrozenator
Copy link
Contributor

afrozenator commented Jun 14, 2018

Hi @mainakchain -- Just to understand you correctly: you are not saying that that tutorial doesn't train -- you implemented your own problem class and that fails to train?

Also what do the logs of t2t-trainer say -- maybe the training is just slow?

@mainakchain
Copy link

Hi @afrozenator I have been at this for some days. I am not able to train my summarization model. At first I created a new problem, and tried registering and training using it. Recently, I shaped my data in the summarize_cnn_dailymail32k problem format and tried to train it with the predefined cnn_dailymail problem. But, it never outputs more than the above 5 files. (When I train other problems, I happen to get files like checkpoint, model.ckpt.* sometime after my training starts using t2t-trainer, which is so obvious).

To my surprise, the problem training on my data (even with predefined daily mail problem) consumes all the cores of my system and even takes up most of my GPU RAM, without giving any info neither on the screen nor as output files. For getting a deeper understanding, I tried to put up the output_dir on tensorboard. All it showed was a graph of transformer architecture. The projector, scalar of other curves are not formed at all.

Any kind of help is highly appreciated. @lukaszkaiser @aviczhl2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants