-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Evaluation failed when training with multiple work gpus #266
Comments
I see exactly the same problem, with T2T 1.2.1 and TensorFlow 1.3. With multiple GPUs I can train with With batch_size 2048, 1024, 512 and 256, it fails just at the beginning of the evaluation (after "Starting evaluation at..." and "Restoring parameters from..."). |
Also having this issue. Have you made any progress on this? |
Could you guys just try |
@lukaszkaiser My dev set has about 110k subwords (BTW: is there an easy way to compute it exactly?). When training on a single GPU with batch_size=2048, I need 55 eval steps to cover it whole. If I use more eval_steps, I see "Out of range: End of sequence" warnings, but the training continues and approx_bleu is computed correctly. |
I tried eval_steps=1/5/8/16 with --worker_gpu=2, none of them worked. |
I tried eval_step=5 with worker_gpu=2 and it failed for batch_size 100 and more (128, 256, 512, 1024, 2048). Usually, it failed within the first evaluation (at different steps 1-4), just once it failed within the second evaluation. I guess this is the relevant portion of the long stacktrace:
|
Update: I checked 1.2.2 and the problem is still there. The problem is that here: Is there a way in TF to split a tensor, while possibly keeping the last returned value with less items if not divisible?
(but |
Until the problem is fixed, is it possible to switch off multi-GPU decoding for evaluation without switching it off for training? |
@ehasler: With |
I made it possible to execute the above code, and evaluation was done halfway. v_size = tf.shape(v)[0]
condition = tf.constant((v_size % self._num_datashards) == 0)
list_of_v = tf.cond(condition,
lambda: tf.split(v, self._num_datashards, 0),
lambda: tf.split(v, [(v_size // self._num_datashards) + 1] \
* (self._num_datashards - 1) + [-1]))
sharded_features[k] = self._data_parallelism(tf.identity, list_of_v) But I encountered another error.
|
TL;DR: I still don't know how to workaround this problem (except for disabling the internal evaluation completely). @ryonakamura Yes, I did something similar and also got the "Reshape cannot infer" error and I was not able to workaround this error. Apparently, the problem is that one of the |
Update: I checked 1.2.3 and 1.2.4 and the problem is still there. @lukaszkaiser, @rsepassi: Can you please comment on this issue? Can you train on multiple GPUs with evaluation? Is a fix planned (e.g. waiting for TF 1.4 as in other issues)? Or should we try to fix this ourselves and is any of the above-mentioned workarounds promising? |
Until we address the underlying issue of getting eval to work on multiple gpus, the workaround would be to have your training job separate from your eval job. For training, set |
Training a new model with T2T 1.2.5 and TF 1.4 allows me to evaluate using multiple gpus. |
@mehmedes are you sure ? |
I confirm the error (Number of ways to split...) is still there even with TF 1.4.0rc1 and T2T 1.2.6. Thanks for the workaround with two jobs (one with |
@vince62s : I was testing on a setup with two 1080s, where it works. Just tried on four 1080 TIs and evaluation still fails. |
@mehmedes I failed to with the same configuration. Four gpus failed and two gpus worked. Really puzzling. |
Please keep reporting your experiences here, it might helps us figure this out too. And thanks for checking and pushing it! |
OK, keeping reporting. The problem is still here with tensorflow-gpu 1.4.0, tensor2tensor 1.2.9. |
@lukaszkaiser I tried it again , when using 2 GPUs and transformer_base_single_gpu , it works well . but it failed when using transformer_base . My configuration is t2t 1.3.2 and tf 1.4.0 . |
@martinpopel have u solve the multi-devices problem recently ? I used 4 GPUs and transformer_base_single_gpu , it fails again with the same error . Really PUZZLING . |
@rsepassi Good idea . But one GPU maybe wasted for evaluation . Anyway , that is a proper way to solve it . |
My solution is #436 (which was merged, but then the
If you are interested I may provide my scripts (after cleaning them a bit). |
@martinpopel Wow, that would be nice ! If the script is easily to use , please send it to me ~ Much THANKS. |
@martinpopel We don't have CPU cluster environment ... |
Ok, so I think the only difference between training and evaluation when it comes to the input pipeline is whether long sequences are skipped. Try setting If so, then I believe what's happening is that the batch size for those long sequences are not divisible by the number of GPUs. Not entirely sure what the fix would be yet, but I think we could check to see if the batch is small and pad it out if it is. |
I am not following, why don't you just eval on a single GPU then ? |
@rsepassi: Even after adding |
Just tried pull request |
Great. Yes, this is fixed in 1.4.1, which will be merged and pushed to PyPI shortly. |
I tried 1.4.1, but it crashed with:
|
Thanks! Will fix. |
I confirm it is fixed (I tested with 2 GPUs so far). Thanks. |
Are you sure this is fixed? So to use 8 GPUs instead of 1, we should only add --worker_gpu=8 and it should work out of the box? Because for me the evaluation step failed... I get:
|
For me the internal evaluation works out of the box on 8 GPUs in T2T v1.4.2. There are 8 times more warnings "W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence", but these are just warnings (and OK in this case), not errors. I can see approx_bleu, rouge etc. in Tensorboard.
|
@martinpopel I solved it by upgrading to TF 1.5.0 (cuda 9.0, cuDNN 7.0). Before that I had TF 1.4.1. Now the regular |
Seeing this again with TF-gpu 1.12, training is fine, but when I want to evaluate, I see: _InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 1) and num_split 2 I'm not using T2T but this was the closest error thread I could find. |
I set worker_gpu to 2 and use 2 gpus in the same node. training is completely fine. But evaluation fails with this error
"
tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 47) and num_split 2
[[Node: split_2 = Split[T=DT_INT32, num_split=2, _device="/job:localhost/replica:0/task:0/cpu:0"](split_2/split_dim, input_reader/ExpandDims_3/_1823)]]
[[Node: split_2/_1825 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1113_split_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:1"]]
Caused by op u'split_2', defined at:
File "/raid/skyw/venv/tensorflow-pip-py27/bin/t2t-trainer", line 5, in
pkg_resources.run_script('tensor2tensor==1.2.1', 't2t-trainer')
"
It wasn't very clear what the error is. But single GPU training/evaluation is fine, the problem comes with 2 worker GPUs.
The text was updated successfully, but these errors were encountered: