张量被分配在不同设备上 #1

npclu0609 · 2024-09-18T06:05:37Z

您好运行的时候报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
您知道是哪里出了问题吗
2024-09-18 13:51:50: Experiment log path in: /Project/Experiment-Project/STGormer/experiments/NYCBike1/20240918-135150
2024-09-18 13:51:50: Experiment configs are: Namespace(seed=31, device='cuda', mode='train', best_path=None, debug=False, data_dir='data', dataset='NYCBike1', input_length=19, output_length=1, batch_size=32, test_batch_size=32, graph_file='data/NYCBike1/adj_mx.npz', num_nodes=128, num_timestamps=168, tod_scaler=1, steps_per_day=24, layers=['S', 'T'], layer_depth=3, pos_embed_T='timepos', cen_embed_S=True, attn_bias_S=True, attn_mask_S=False, attn_mask_T=False, moe_status='SoftMoE', moe_mlr=False, num_experts=6, moe_dropout=0.1, top_k=1, moe_add_ff=False, moe_position='Full', expertWeightsAda=False, expertWeights=[0.8, 0.2], d_input=4, d_output=2, d_model=64, d_time_embed=24, d_space_embed=24, num_heads=4, mlp_ratio=4, dropout=0.1, yita=0.5, fft_status=False, epochs=200, lr_init=0.001, scheduler='StepLR', step_size=25, milestones=[1, 60, 90, 120, 150], factor=0.8, patience=10, gamma=0.5, mask_value_train=5.0, mask_value_test=5.0, early_stop=True, early_stop_patience=30, grad_norm=True, max_grad_norm=5, use_dwa=False, temp=4, save_path=None, num_shortpath=16, num_node_deg=9, log_dir='/Project/Experiment-Project/STGormer/experiments/NYCBike1/20240918-135150')
2024-09-18 13:51:50: Traceback (most recent call last):
File "/Project/Experiment-Project/STGormer/main.py", line 87, in model_supervisor
results = trainer.train() # best_eval_loss, best_epoch
^^^^^^^^^^^^^^^
File "/Project/Experiment-Project/STGormer/model/trainer.py", line 107, in train
train_epoch_loss = self.train_epoch(epoch)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Project/Experiment-Project/STGormer/model/trainer.py", line 57, in train_epoch
repr, aux_loss = self.model(data, self.graph) # [B,N,C]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Project/Experiment-Project/STGormer/model/models.py", line 48, in forward
repr, aux_loss = self.encoder(view, graph) #[B, N, T, D]
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Project/Experiment-Project/STGormer/model/layers.py", line 75, in forward
encoder_input, _ = self.positional_encoding_1d(encoder_input) # BN, T, D
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Project/Experiment-Project/STGormer/model/positional_encoding.py", line 13, in forward
pos_enc = tp_enc_1d(input_data)
^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/positional_encodings/torch_encodings.py", line 41, in forward
sin_inp_x = torch.einsum("i,j->ij", pos_x, self.inv_freq)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/functional.py", line 386, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

jasonz5 · 2024-09-28T16:00:06Z

你好，抱歉回复的有点晚。
当前版本的代码我在本地测试是没有报错的。关于变量不在同一设备，可能是因为不同环境中设备配置的差异。
关于解决该问题，我以往是根据报错定位到具体行，设置断点<import ipdb; ipdb.set_trace()>，然后检查下变量或模型的位置<model/tensor.device()>，然后将变量或模型移动到统一位置<tensor1.to(tensor2.device())>
希望能对你有所帮助。

npclu0609 · 2024-09-30T05:28:36Z

好的好的，我会按照你的方法试一试，非常感谢！在 2024-09-29 00:00:27，"Jason Zhou" ***@***.***> 写道：你好，抱歉回复的有点晚。当前版本的代码我在本地测试是没有报错的。关于变量不在同一设备，可能是因为不同环境中设备配置的差异。关于解决该问题，我以往是根据报错定位到具体行，设置断点<import ipdb; ipdb.set_trace()>，然后检查下变量或模型的位置<model/tensor.device()>，然后将变量或模型移动到统一位置<tensor1.to(tensor2.device())> 希望能对你有所帮助。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

张量被分配在不同设备上 #1

张量被分配在不同设备上 #1

npclu0609 commented Sep 18, 2024

jasonz5 commented Sep 28, 2024

npclu0609 commented Sep 30, 2024 via email

张量被分配在不同设备上 #1

张量被分配在不同设备上 #1

Comments

npclu0609 commented Sep 18, 2024

jasonz5 commented Sep 28, 2024

npclu0609 commented Sep 30, 2024 via email