Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Multi-GPU decoding support #30

Closed
cshanbo opened this issue Jun 23, 2017 · 5 comments
Closed

Multi-GPU decoding support #30

cshanbo opened this issue Jun 23, 2017 · 5 comments

Comments

@cshanbo
Copy link
Contributor

cshanbo commented Jun 23, 2017

Hi all,

I'm wondering whether tensor2tensor support multi-GPU decoding for now? (wmt translation task)

I'm saying this because when I tried to use multiple GPU cards to decode a data (translation task), the following exception raised, while no exception in a single GPU decoding scenario.

I'm putting the decoding script and full exception trace here. Thank you.

decoding script

t2t-trainer   --data_dir=/tensor2tensor/t2t_data   --problems=wmt_ende_tokens_32k \
    --model=transformer   --hparams_set=transformer_base --worker_gpu=3 \
    --output_dir=/tensor2tensor/exp/8cards/wmt_ende_tokens_32k/transformer-transformer_base \  
    --train_steps=0   --eval_steps=0   --decode_beam_size=4   --decode_alpha=0.6 \
    --decode_use_last_position_only  --decode_batch_size=128 \ 
    --decode_from_file=/tensor2tensor/t2t_data/validate.en

exception info

INFO:tensorflow:Restoring parameters from /search/odin/public/experiments/tensor2tensor/exp/8cards/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-56426
2017-06-23 11:34:20.978020: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
2017-06-23 11:34:20.978210: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
Traceback (most recent call last):
  File "/search/odin/public/anaconda2/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.4', 't2t-trainer')
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1507, in run_script
    exec(script_code, namespace, namespace)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
    
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
    
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 240, in run
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 646, in decode_from_file
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 902, in _predict_generator
    preds = mon_sess.run(predictions, feed_fn() if feed_fn else None)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
	 [[Node: while/GatherNd/_1405 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_4080_while/GatherNd", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](^_cloopwhile/parallel_0/Identity/_1292)]]

Caused by op u'while/split', defined at:
  File "/search/odin/public/anaconda2/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.4', 't2t-trainer')
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1507, in run_script
    exec(script_code, namespace, namespace)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
    decode_from_file(estimator, FLAGS.decode_from_file)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 645, in decode_from_file
    result_iter = estimator.predict(input_fn=input_fn.next, as_iterable=True)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
    as_iterable=as_iterable)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 884, in _infer_model
    infer_ops = self._get_predict_ops(features)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1218, in _get_predict_ops
    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.INFER)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn
    model_fn_results = self._model_fn(features, labels, **kwargs)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 423, in model_fn
    len(hparams.problems) - 1)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 748, in _cond_on_index
    return fn(cur_idx)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 396, in nth_model
    decode_length=FLAGS.decode_extra_length)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 154, in infer
    last_position_only, alpha)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 211, in _beam_decode
    alpha)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 405, in beam_search
    back_prop=False)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2766, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2595, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2545, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 336, in inner_loop
    i, alive_seq, alive_log_probs)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 240, in grow_topk
    flat_logits = symbols_to_logits_fn(flat_ids)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 181, in symbols_to_logits_fn
    features, False, last_position_only=last_position_only)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 352, in model_fn
    sharded_features = self._shard_features(features)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 332, in _shard_features
    0))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1214, in split
    split_dim=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3261, in _split
    num_split=num_split, name=name)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
	 [[Node: while/GatherNd/_1405 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_4080_while/GatherNd", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](^_cloopwhile/parallel_0/Identity/_1292)]]

@lkfo415579
Copy link

i have the same problem here.

@cshanbo
Copy link
Contributor Author

cshanbo commented Jun 29, 2017 via email

@lukaszkaiser
Copy link
Contributor

Indeed, we need to work on multi-gpu decoding.

@martinpopel
Copy link
Contributor

This (multi-gpu decoding) is the same problem as discussed in #266 (multi-gpu internal evaluation).

@rsepassi
Copy link
Contributor

rsepassi commented Oct 1, 2017

Closing in favor of continuing discussion/resolution in #266

@rsepassi rsepassi closed this as completed Oct 1, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants