Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Error occurs when using multiple gpus #481

Closed
njoe9 opened this issue Dec 21, 2017 · 2 comments
Closed

Error occurs when using multiple gpus #481

njoe9 opened this issue Dec 21, 2017 · 2 comments

Comments

@njoe9
Copy link

njoe9 commented Dec 21, 2017

Hi, all:

I cannot apply tensor2tensor to train translation model with multiple gpus(worker_gpu=4) in a server.
The versions of tensorflow and t2t are 1.4 and 1.3.2 respectively.

The error as follows:

InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 1) and num_split 4 [[Node: transformer/split = Split[T=DT_INT32, num_split=4, _device="/job:localhost/replica:0/task:0/device:CPU:0"](t ransformer/split/split_dim, input_fn/ExpandDims_1)]] [[Node: transformer/body/model/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/T ensordot/Gather/_5817 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_d evice="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_18318_transformer/body/mod el/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/Tensordot/Gather", tensor_type=DT_INT3 2, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

The training script is the following:
t2t-trainer \ --data_dir=$DATA_DIR \ --problems=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --hparams='hidden_size=1024,batch_size=4096,num_heads=16,attention_key_channels=64,attention_value_channels=64' \ --train_steps=500000 \ --worker_gpu_memory_fraction=0.98 \ --worker_gpu=4 \ --output_dir=$TRAIN_DIR

Then what's the problem?

Thanks.

@mehmedes
Copy link

This seems to refer to #266.
Try setting --schedule=train to disable evaluation or apply the workaround mentioned in #266.

@rsepassi
Copy link
Contributor

Thank you @mehmedes. Closing in favor of #266

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants