You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
I cannot apply tensor2tensor to train translation model with multiple gpus(worker_gpu=4) in a server. The versions of tensorflow and t2t are 1.4 and 1.3.2 respectively.
The error as follows:
InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 1) and num_split 4 [[Node: transformer/split = Split[T=DT_INT32, num_split=4, _device="/job:localhost/replica:0/task:0/device:CPU:0"](t ransformer/split/split_dim, input_fn/ExpandDims_1)]] [[Node: transformer/body/model/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/T ensordot/Gather/_5817 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_d evice="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_18318_transformer/body/mod el/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/Tensordot/Gather", tensor_type=DT_INT3 2, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
The training script is the following: t2t-trainer \ --data_dir=$DATA_DIR \ --problems=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --hparams='hidden_size=1024,batch_size=4096,num_heads=16,attention_key_channels=64,attention_value_channels=64' \ --train_steps=500000 \ --worker_gpu_memory_fraction=0.98 \ --worker_gpu=4 \ --output_dir=$TRAIN_DIR
Then what's the problem?
Thanks.
The text was updated successfully, but these errors were encountered:
Hi, all:
I cannot apply tensor2tensor to train translation model with multiple gpus(worker_gpu=4) in a server.
The versions of tensorflow and t2t are 1.4 and 1.3.2 respectively.
The error as follows:
InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 1) and num_split 4 [[Node: transformer/split = Split[T=DT_INT32, num_split=4, _device="/job:localhost/replica:0/task:0/device:CPU:0"](t ransformer/split/split_dim, input_fn/ExpandDims_1)]] [[Node: transformer/body/model/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/T ensordot/Gather/_5817 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_d evice="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_18318_transformer/body/mod el/parallel_1/body/decoder/layer_4/self_attention/multihead_attention/output_transform/Tensordot/Gather", tensor_type=DT_INT3 2, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
The training script is the following:
t2t-trainer \ --data_dir=$DATA_DIR \ --problems=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --hparams='hidden_size=1024,batch_size=4096,num_heads=16,attention_key_channels=64,attention_value_channels=64' \ --train_steps=500000 \ --worker_gpu_memory_fraction=0.98 \ --worker_gpu=4 \ --output_dir=$TRAIN_DIR
Then what's the problem?
Thanks.
The text was updated successfully, but these errors were encountered: