Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

why the results of the evaluation are all zero? #121

Closed
ZhenYangIACAS opened this issue Jul 10, 2017 · 25 comments
Closed

why the results of the evaluation are all zero? #121

ZhenYangIACAS opened this issue Jul 10, 2017 · 25 comments
Labels

Comments

@ZhenYangIACAS
Copy link

Why the results of the evaluation are all zero?

INFO:tensorflow:Saving dict for global step 7724: global_step = 7724, loss = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0, metrics/approx_bleu_score = 0.0, metrics/neg_log_perplexity = 0.0
INFO:tensorflow:Validation (step 8000): loss = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, global_step = 7724, metrics/neg_log_perplexity = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0

@lukaszkaiser
Copy link
Contributor

Very strange -- looks like nothing was evaluated. Are you having a super-small batch_size that's preventing any evaluation maybe?

@stefan-it
Copy link
Contributor

stefan-it commented Jul 10, 2017

I think I was able to reproduce this for wmt_ende_tokens_32k on a DGX-1. Then I switched back to 1.0.10 and it worked.

@ZhenYangIACAS
Copy link
Author

ZhenYangIACAS commented Jul 11, 2017

The problem I am running is wmt_ende_bpe32k and all of the parameters are set as default. The batch-size is not very super-small. I am using tensor2tensor1.0.9.

@lukaszkaiser
Copy link
Contributor

Can you try with 1.0.13? If it doesn't work, please let us know the exact commands you run so we can reproduce. Is it exactly the walkthrough or did you change anything? Single GPU with TF 1.2.0?

@ZhenYangIACAS
Copy link
Author

@lukaszkaiser ok,I will try 1.0.13 now and I will report the results later. Thank you

@lukaszkaiser
Copy link
Contributor

Thank you, we'll get to the bottom of this!

@ZhenYangIACAS
Copy link
Author

@lukaszkaiser I have tested 1.0.13 with tensorflow 1.2.0. And I am running wmt-ende-bpe32k. transformor_base. I still find that the evaluation result is all zero.
INFO:tensorflow:Saving dict for global step 3845: global_step = 3845, loss = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0, metrics/approx_bleu_score = 0.0, metrics/neg_log_perplexity = 0.0
INFO:tensorflow:Validation (step 4000): loss = 0.0, metrics-wmt_ende_bpe32k/accuracy_per_sequence = 0.0, global_step = 3845, metrics/neg_log_perplexity = 0.0, metrics-wmt_ende_bpe32k/accuracy = 0.0, metrics-wmt_ende_bpe32k/accuracy_top5 = 0.0, metrics-wmt_ende_bpe32k/neg_log_perplexity = 0.0, metrics/accuracy = 0.0, metrics/approx_bleu_score = 0.0, metrics-wmt_ende_bpe32k/approx_bleu_score = 0.0, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.0

@lukaszkaiser
Copy link
Contributor

Are you running exactly as in Walkthrough or did you change anything, hparams?

@ZhenYangIACAS
Copy link
Author

@lukaszkaiser I am running exactly in Walkthrough and I never change any parameters. I have tried 1.0.9 , 1.0.10 and 1.0.13。 All of my evaluation results are zero. However, The decoding process is right but I got 26.56 at newstest2014, which is a little lower than the results reported in the paper. Is there anything wrong for me?

@lukaszkaiser
Copy link
Contributor

I still can't reproduce this, can you check (with the new inspect.py script in utils) if your dev data file was generated right? 26.56 sounds not too bad for starters!

@stefan-it
Copy link
Contributor

I guess tensor2tensor/data_generators/inspect.py is meant :)

@ZhenYangIACAS
Copy link
Author

@lukaszkaiser I am trying the inspect.py, however, some bugs occurs:
Traceback (most recent call last):
File "inspect.py", line 32, in
from tensor2tensor.data_generators import text_encoder
File "/home/user/anaconda2/lib/python2.7/site-packages/tensor2tensor/data_generators/text_encoder.py", line 34, in
from tensor2tensor.data_generators import tokenizer
File "/home/user/anaconda2/lib/python2.7/site-packages/tensor2tensor/data_generators/tokenizer.py", line 56, in
import tensorflow as tf
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/init.py", line 24, in
from tensorflow.python import *
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/init.py", line 63, in
from tensorflow.python.framework.framework_lib import *
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/framework_lib.py", line 75, in
from tensorflow.python.framework.ops import Graph
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 40, in
from tensorflow.python.framework import errors
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors.py", line 22, in
from tensorflow.python.framework import errors_impl as _impl
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 27, in
from tensorflow.python.util import compat
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 43, in
from tensorflow.python.util.all_util import remove_undocumented
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/all_util.py", line 24, in
from tensorflow.python.util import tf_inspect as _tf_inspect
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/tf_inspect.py", line 20, in
import inspect as _inspect
File "/data/zhyang/tfNmt/tensor2tensor/inspect.py", line 32, in
from tensor2tensor.data_generators import text_encoder
ImportError: cannot import name text_encoder

@ZhenYangIACAS
Copy link
Author

I find that the name inspect is the built-in file of python. So, I guess the name "inspect" is not suitable. I changed the filename as inspect_my.py and the error disappear. However, a unicode error errors.
Traceback (most recent call last):
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.13/tensor2tensor/data_generators/inspect_my.py", line 101, in
tf.app.run()
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.13/tensor2tensor/data_generators/inspect_my.py", line 69, in main
FLAGS.subword_text_encoder_filename)
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.13/tensor2tensor/data_generators/text_encoder.py", line 211, in init
self._load_from_file(filename)
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.13/tensor2tensor/data_generators/text_encoder.py", line 456, in _load_from_file
subtoken_strings.append(native_to_unicode(line.strip()[1:-1]))
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.13/tensor2tensor/data_generators/text_encoder.py", line 41, in native_to_unicode
return s.decode("utf-8") if (PY2 and not isinstance(s, unicode)) else s
File "/home/user/anaconda2/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 0: invalid start byte

@lukaszkaiser
Copy link
Contributor

We corrected the unicode functions in 1.0.14. Can you try to re-generate the data again? I just ran the Walkthrough and it seems to work for me this far:

...
I0714 18:15:27.857059    5821 basic_session_run_hooks.py:246] step = 1801, loss = 5.87307 (56.110 sec)
I0714 18:16:23.859897    5821 basic_session_run_hooks.py:519] global_step/sec: 1.78561
I0714 18:16:23.860393    5821 basic_session_run_hooks.py:246] step = 1901, loss = 5.99645 (56.003 sec)
I0714 18:17:19.936841    5821 trainer_utils.py:1273] datashard_devices: ['gpu:0']
I0714 18:17:19.936961    5821 trainer_utils.py:1274] caching_devices: None
I0714 18:17:22.201944    5821 t2t_model.py:27] Doing model_fn_body took 2.143 sec.
I0714 18:17:22.266063    5821 t2t_model.py:431] This model_fn took 2.309 sec.
I0714 18:17:23.334647    5821 evaluation.py:165] Starting evaluation at 2017-07-15-01:17:23
I0714 18:17:23.987706    5821 gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
I0714 18:17:23.988263    5821 saver.py:1559] Restoring parameters from /tmp/tensor2tensor/model.ckpt-1076
I0714 18:17:30.158215    5821 evaluation.py:87] Evaluation [1/10]
I0714 18:17:30.605232    5821 evaluation.py:87] Evaluation [2/10]
I0714 18:17:31.076565    5821 evaluation.py:87] Evaluation [3/10]
I0714 18:17:31.552934    5821 evaluation.py:87] Evaluation [4/10]
I0714 18:17:32.130570    5821 evaluation.py:87] Evaluation [5/10]
I0714 18:17:32.576243    5821 evaluation.py:87] Evaluation [6/10]
I0714 18:17:33.009226    5821 evaluation.py:87] Evaluation [7/10]
I0714 18:17:33.491557    5821 evaluation.py:87] Evaluation [8/10]
I0714 18:17:33.938620    5821 evaluation.py:87] Evaluation [9/10]
I0714 18:17:34.401911    5821 evaluation.py:87] Evaluation [10/10]
I0714 18:17:42.867368    5821 evaluation.py:185] Finished evaluation at 2017-07-15-01:17:42
I0714 18:17:42.867530    5821 estimator.py:331] Saving dict for global step 1076: global_step = 1076, loss = 6.42422, metrics-wmt_ende_tokens_32k/accuracy = 0.113703, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.217065, metrics-wmt_ende_tokens_32k/approx_bleu_score = 0.0395998, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -7.34322, metrics/accuracy = 0.113703, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.217065, metrics/approx_bleu_score = 0.0395998, metrics/neg_log_perplexity = -7.34322
I0714 18:17:43.536043    5821 monitors.py:701] Validation (step 2000): metrics/approx_bleu_score = 0.0395998, global_step = 1076, metrics-wmt_ende_tokens_32k/approx_bleu_score = 0.0395998, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.217065, metrics/accuracy = 0.113703, metrics-wmt_ende_tokens_32k/accuracy = 0.113703, metrics/accuracy_per_sequence = 0.0, metrics/neg_log_perplexity = -7.34322, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -7.34322, metrics/accuracy_top5 = 0.217065, loss = 6.42422
I0714 18:17:44.087800    5821 basic_session_run_hooks.py:519] global_step/sec: 1.24645
I0714 18:17:44.088317    5821 basic_session_run_hooks.py:246] step = 2001, loss = 5.80115 (80.228 sec)
I0714 18:18:38.266354    5821 basic_session_run_hooks.py:454] Saving checkpoints for 2098 into /tmp/tensor2tensor/model.ckpt.
I0714 18:18:42.507443    5821 basic_session_run_hooks.py:519] global_step/sec: 1.71176
I0714 18:18:42.507977    5821 basic_session_run_hooks.py:246] step = 2101, loss = 5.85146 (58.420 sec)
I0714 18:19:38.842911    5821 basic_session_run_hooks.py:519] global_step/sec: 1.77508
...

@ZhenYangIACAS
Copy link
Author

ZhenYangIACAS commented Jul 15, 2017

I have tried the 1.0.14 and the bug still occurs. As my previous comments, I still need to change the name of inspect.py as it conflict with the module of the python. Why don't you meet this problem? What is the version of your python? After I changed the name of inspect.py, the bug as this:
Traceback (most recent call last):
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.14/tensor2tensor/data_generators/inspect_my.py", line 83, in
tf.app.run()
File "/home/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.14/tensor2tensor/data_generators/inspect_my.py", line 51, in main
FLAGS.subword_text_encoder_filename)
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.14/tensor2tensor/data_generators/text_encoder.py", line 229, in init
self._load_from_file(filename)
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.14/tensor2tensor/data_generators/text_encoder.py", line 476, in _load_from_file
subtoken_strings.append(native_to_unicode(line.strip()[1:-1]))
File "/data/zhyang/tfNmt/tensor2tensor/codes/tensor2tensor-1.0.14/tensor2tensor/data_generators/text_encoder.py", line 57, in native_to_unicode_py2
return s.decode("utf-8")
File "/home/user/anaconda2/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 0: invalid start byte

@jarfo
Copy link
Contributor

jarfo commented Jul 19, 2017

I have the same validation problem with wmt_ende_bpe32k. I had no problems with wmt_ende_tokens_32k. I can inspect the dev file without problems (after changing of name of inspect.py to inspect2.py) but no validation results are provided during training

@martinpopel
Copy link
Contributor

Now I see this problem with the newest t2t, but only in one of two almost identical experiments (wmt_encs_tokens_32k transformer_base_single_gpu). Running on one GPU (GeForce GTX 1080 Ti, batch_size=7000) works OK, but running on 4 GPUs (GeForce GTX 1080, batch_size=5000, --worker_gpu=4) causes all validation scores to be reported as zero. The model is fine in both cases and an external evaluation gives reasonable BLEU scores in both cases.
Is it OK (even if suboptimal) to use the transformer_base_single_gpu model even when running on multiple GPUs?

@martinpopel
Copy link
Contributor

Update with the newest t2t v1.1.9: the issue is still here: When training with multiple GPUs, all validation scores are reported as zero.
I have used exactly the same setup (same batch size 2048, same machine) and with --worker_gpu=1 it works, but with 4 or 8 GPUs the zeros are reported (although the model is OK, external evaluation gives nice BLEU). It is unfortunate I cannot check the approx_bleu curve in tensorboard, and external evaluation is so slow due to #70. Any ideas?

BTW (not related to this issue, just answering my own question from the last post): even with 8 GPUs, transformer_big_single_gpu gives me much better results (on translate_ende_wmt32k) than transformer_big. I guess it is because I used batch_size=2048 instead of the default batch_size=4096, so I benefit from the longer learning_rate_warmup_steps.

@colmantse
Copy link

i think you do can check the approx_bleu curve, it is at the metric-problem name, at least it is there for me.

@martinpopel
Copy link
Contributor

@colmantse: I can see the approx_bleu curve in TensorBoard, but it is constantly zero, in accordance with the stderr printouts. The only metric I see is the training loss (which I need to smooth greatly in TensorBoard).

...
INFO:tensorflow:step = 121901, loss = 1.67543 (72.656 sec)
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6', 'gpu:7']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Doing model_fn_body took 2.692 sec.
INFO:tensorflow:Doing model_fn_body took 1.583 sec.
INFO:tensorflow:Doing model_fn_body took 1.589 sec.
INFO:tensorflow:Doing model_fn_body took 1.594 sec.
INFO:tensorflow:Doing model_fn_body took 1.567 sec.
INFO:tensorflow:Doing model_fn_body took 2.921 sec.
INFO:tensorflow:Doing model_fn_body took 1.566 sec.
INFO:tensorflow:Doing model_fn_body took 1.567 sec.
INFO:tensorflow:This model_fn took 16.102 sec.  
INFO:tensorflow:Starting evaluation at 2017-08-22-11:49:34  
2017-08-22 13:49:35.076361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0)
2017-08-22 13:49:35.076502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)  
2017-08-22 13:49:35.076522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0)  
2017-08-22 13:49:35.076553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:07:00.0)
2017-08-22 13:49:35.076566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:08:00.0)
2017-08-22 13:49:35.076592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0)
2017-08-22 13:49:35.076608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX 1080 Ti, pci bus id: 0000:0c:00.0)
2017-08-22 13:49:35.076621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0)
INFO:tensorflow:Restoring parameters from train/translate_ende_wmt32k/transformer-transformer_big/model.ckpt-121644
INFO:tensorflow:Finished evaluation at 2017-08-22-11:49:52
INFO:tensorflow:Saving dict for global step 121644: global_step = 121644, loss = 0.0, metrics-translate_ende_wmt32k/accuracy = 0.0, metrics-translate_ende_wmt32k/accuracy_per_sequence = 0.0, metrics-translate_ende_wmt32k/accuracy_top5 = 0.0, metrics-translate_ende_wmt32k/approx_bleu_score = 0.0, metrics-translate_ende_wmt32k/neg_log_perplexity = 0.0, metrics-translate_ende_wmt32k/rouge_2_fscore = 0.0, metrics-translate_ende_wmt32k/rouge_L_fscore = 0.0
INFO:tensorflow:Validation (step 122000): metrics-translate_ende_wmt32k/neg_log_perplexity = 0.0, metrics-translate_ende_wmt32k/approx_bleu_score = 0.0, loss = 0.0, global_step = 121644, metrics-translate_ende_wmt32k/accuracy_top5 = 0.0, metrics-translate_ende_wmt32k/rouge_2_fscore = 0.0, metrics-translate_ende_wmt32k/rouge_L_fscore = 0.0, metrics-translate_ende_wmt32k/accuracy = 0.0, metrics-translate_ende_wmt32k/accuracy_per_sequence = 0.0
INFO:tensorflow:global_step/sec: 0.920878
INFO:tensorflow:step = 122001, loss = 1.54798 (108.595 sec)
...

@colmantse
Copy link

sorry @martinpopel , my bad, I hope t2t team is able to help you out with this one.

@martinpopel
Copy link
Contributor

With T2T 1.2.1 the situation is even worse: multi-gpu training with internal evaluation fails completely, see #266. So as for me, you can close this issue.

@rsepassi rsepassi added the bug label Nov 14, 2017
@rsepassi
Copy link
Contributor

We're aware of the multi-gpu eval problem.
But we can't reproduce the eval metrics being 0. Closing for now, but please reopen if it's still an issue.

@yinyajun
Copy link

I ran into the similar problem. I just use tensorflow.estimator.train_and_evaluate api to construct distributed training. Evaluation in 'evaluator' reports all zero.

18/07/25 12:15:34 INFO HboxContainer: INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpoint_max': 5, '_task_type': u'evaluator', '_global_id_in_cluster': None, '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f926f56b890>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': 'hdfs://namenode.dfs.shbt.abc.net:9000/home/hdp-abc/yinyajun/model/dcn_1', '_save_summary_steps': 100}
18/07/25 12:15:34 INFO HboxContainer: INFO:tensorflow:Waiting 120.000000 secs before starting eval.
18/07/25 12:17:35 INFO HboxContainer: SLF4J: Class path contains multiple SLF4J bindings.
18/07/25 12:17:35 INFO HboxContainer: SLF4J: Found binding in [jar:file:/home/yarn/software/hadoop-2.7.2U12/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
18/07/25 12:17:35 INFO HboxContainer: SLF4J: Found binding in [jar:file:/home/yarn/software/lib4yarn-1.4.4/parquet/xitong-hadoop-parquet-1.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
18/07/25 12:17:35 INFO HboxContainer: SLF4J: Found binding in [jar:file:/home/yarn/software/lib4yarn-1.4.4/parquet/parquet-tools-1.9.1U1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
18/07/25 12:17:35 INFO HboxContainer: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
18/07/25 12:17:35 INFO HboxContainer: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/07/25 12:17:46 INFO HboxContainer: Parsing hdfs://namenode.safe.lycc.abc.net:9000/home/hdp-abc/data/dnn_reward_dataset/20180708/validation
18/07/25 12:17:46 INFO HboxContainer: INFO:tensorflow:Calling model_fn.
18/07/25 12:17:46 INFO HboxContainer: **************************************************
18/07/25 12:17:46 INFO HboxContainer: Model InFo: DeepCrossNetwork
18/07/25 12:17:46 INFO HboxContainer: ('Dropout:', None)
18/07/25 12:17:46 INFO HboxContainer: ('BN:', True)
18/07/25 12:17:46 INFO HboxContainer: ('activation:', 'relu')
18/07/25 12:17:46 INFO HboxContainer: ('optimizer:', 'AdamOptimizer')
18/07/25 12:17:46 INFO HboxContainer: ('l2_reg:', None)
18/07/25 12:17:46 INFO HboxContainer: ('hidden_units:', [512, 256, 128])
18/07/25 12:17:46 INFO HboxContainer: **************************************************
18/07/25 12:17:56 INFO HboxContainer: INFO:tensorflow:Done calling model_fn.
18/07/25 12:17:56 INFO HboxContainer: INFO:tensorflow:Starting evaluation at 2018-07-25-04:17:56
18/07/25 12:17:57 INFO HboxContainer: INFO:tensorflow:Graph was finalized.
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.253288: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.525495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
18/07/25 12:17:57 INFO HboxContainer: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
18/07/25 12:17:57 INFO HboxContainer: pciBusID: 0000:02:00.0
18/07/25 12:17:57 INFO HboxContainer: totalMemory: 22.38GiB freeMemory: 22.21GiB
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.525541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.998633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.998693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.998716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
18/07/25 12:17:57 INFO HboxContainer: 2018-07-25 12:17:57.999315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21551 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
18/07/25 12:17:58 INFO HboxContainer: INFO:tensorflow:Restoring parameters from hdfs://namenode.dfs.shbt.abc.net:9000/home/hdp-abc/yinyajun/model/dcn_1/model.ckpt-1500026
18/07/25 12:18:21 INFO HboxContainer: INFO:tensorflow:Running local_init_op.
18/07/25 12:18:21 INFO HboxContainer: INFO:tensorflow:Done running local_init_op.
18/07/25 12:18:25 INFO HboxContainer: INFO:tensorflow:Finished evaluation at 2018-07-25-04:18:25
18/07/25 12:18:25 INFO HboxContainer: INFO:tensorflow:Saving dict for global step 1500027: accuracy = 0.0, auc = 0.0, global_step = 1500027, loss = 0.0, precision = 0.0, recall = 0.0
18/07/25 12:18:31 INFO HboxContainer: INFO:tensorflow:Performing the final export in the end of training.
18/07/25 12:18:31 INFO HboxContainer: INFO:tensorflow:Calling model_fn.

However, when local training, evaluation works well.

@soloice
Copy link

soloice commented Oct 18, 2019

Very strange -- looks like nothing was evaluated. Are you having a super-small batch_size that's preventing any evaluation maybe?

@lukaszkaiser Any idea why super-small batch size can cause accuracy to be 0?
I guess I met a similar issue on tf.metric (through not in t2t). I'm doing multi-task learning (>50 tasks in total) with google-bert and use tf.Estimator to calculate valid accuracy per task. During validation, while most metrics are normal, occasionally there is a task with zero accuracy. I manually checked the model prediction of the task and found it worked normally, so probably there's something wrong with tf.metric. The corrupted task id is random, but a common feature of the corrupted tasks is they have a relatively small valid set (<40) and can fit into 1 or at most 2 mini-batches. Tasks with a large valid set never encounter this issue.
What might cause this issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants