-
Notifications
You must be signed in to change notification settings - Fork 3.5k
why the results of the evaluation are all zero? #121
Comments
Very strange -- looks like nothing was evaluated. Are you having a super-small batch_size that's preventing any evaluation maybe? |
I think I was able to reproduce this for |
The problem I am running is wmt_ende_bpe32k and all of the parameters are set as default. The batch-size is not very super-small. I am using tensor2tensor1.0.9. |
Can you try with 1.0.13? If it doesn't work, please let us know the exact commands you run so we can reproduce. Is it exactly the walkthrough or did you change anything? Single GPU with TF 1.2.0? |
@lukaszkaiser ok,I will try 1.0.13 now and I will report the results later. Thank you |
Thank you, we'll get to the bottom of this! |
@lukaszkaiser I have tested 1.0.13 with tensorflow 1.2.0. And I am running wmt-ende-bpe32k. transformor_base. I still find that the evaluation result is all zero. |
Are you running exactly as in Walkthrough or did you change anything, hparams? |
@lukaszkaiser I am running exactly in Walkthrough and I never change any parameters. I have tried 1.0.9 , 1.0.10 and 1.0.13。 All of my evaluation results are zero. However, The decoding process is right but I got 26.56 at newstest2014, which is a little lower than the results reported in the paper. Is there anything wrong for me? |
I still can't reproduce this, can you check (with the new |
I guess |
@lukaszkaiser I am trying the inspect.py, however, some bugs occurs: |
I find that the name inspect is the built-in file of python. So, I guess the name "inspect" is not suitable. I changed the filename as inspect_my.py and the error disappear. However, a unicode error errors. |
We corrected the unicode functions in 1.0.14. Can you try to re-generate the data again? I just ran the Walkthrough and it seems to work for me this far:
|
I have tried the 1.0.14 and the bug still occurs. As my previous comments, I still need to change the name of inspect.py as it conflict with the module of the python. Why don't you meet this problem? What is the version of your python? After I changed the name of inspect.py, the bug as this: |
I have the same validation problem with wmt_ende_bpe32k. I had no problems with wmt_ende_tokens_32k. I can inspect the dev file without problems (after changing of name of inspect.py to inspect2.py) but no validation results are provided during training |
Now I see this problem with the newest t2t, but only in one of two almost identical experiments (wmt_encs_tokens_32k transformer_base_single_gpu). Running on one GPU (GeForce GTX 1080 Ti, batch_size=7000) works OK, but running on 4 GPUs (GeForce GTX 1080, batch_size=5000, |
Update with the newest t2t v1.1.9: the issue is still here: When training with multiple GPUs, all validation scores are reported as zero. BTW (not related to this issue, just answering my own question from the last post): even with 8 GPUs, |
i think you do can check the approx_bleu curve, it is at the metric-problem name, at least it is there for me. |
@colmantse: I can see the approx_bleu curve in TensorBoard, but it is constantly zero, in accordance with the stderr printouts. The only metric I see is the training loss (which I need to smooth greatly in TensorBoard).
|
sorry @martinpopel , my bad, I hope t2t team is able to help you out with this one. |
With T2T 1.2.1 the situation is even worse: multi-gpu training with internal evaluation fails completely, see #266. So as for me, you can close this issue. |
We're aware of the multi-gpu eval problem. |
I ran into the similar problem. I just use tensorflow.estimator.train_and_evaluate api to construct distributed training. Evaluation in 'evaluator' reports all zero. 18/07/25 12:15:34 INFO HboxContainer: INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpoint_max': 5, '_task_type': u'evaluator', '_global_id_in_cluster': None, '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f926f56b890>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': 'hdfs://namenode.dfs.shbt.abc.net:9000/home/hdp-abc/yinyajun/model/dcn_1', '_save_summary_steps': 100} However, when local training, evaluation works well. |
@lukaszkaiser Any idea why super-small batch size can cause accuracy to be 0? |
Why the results of the evaluation are all zero?
INFO:tensorflow:Saving dict for global step 7724: global_step = 7724, loss = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0, metrics/approx_bleu_score = 0.0, metrics/neg_log_perplexity = 0.0
INFO:tensorflow:Validation (step 8000): loss = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, global_step = 7724, metrics/neg_log_perplexity = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0
The text was updated successfully, but these errors were encountered: