LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 #834

dxjjhm · 2021-05-14T06:09:34Z

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

extra yaml file
your code
script for running

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Colab Links
If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

OS: [e.g. Linux, macOS or Windows]
RecBole Version [e.g. 0.1.0]
Python Version [e.g. 3.79]
PyTorch Version [e.g. 1.60]
cudatoolkit Version [e.g. 9.2, none]

chenyushuo · 2021-05-14T07:13:34Z

能否提供一下报错信息，以及yaml文件

dxjjhm · 2021-05-24T07:41:51Z

Traceback (most recent call last):
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "lightgcnSDYD.py", line 60, in predict
    topk_score, topk_iid_list = full_sort_topk(uid_series, lightgcn_model, test_data, 10)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 87, in full_sort_topk
    scores = full_sort_scores(uid_series, model, test_data)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 45, in full_sort_scores
    history_row = torch.cat([torch.full_like(hist_iid, i) for i, hist_iid in enumerate(history_item)])
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/CPUType.cpp:2127 [kernel]
CUDA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/CUDAType.cpp:2983 [kernel]
QuantizedCPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/QuantizedCPUType.cpp:297 [kernel]
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
Tracer: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/TraceType_2.cpp:9654 [kernel]
Autocast: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/autocast_mode.cpp:258 [kernel]
Batched: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

dxjjhm · 2021-05-24T07:43:23Z

使用GPU训练，然后GPU加载预测，就会爆出如上错误，改成CPU加载并预测没有错误。

chenyushuo · 2021-05-25T01:19:07Z

File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 45, in full_sort_scores
history_row = torch.cat([torch.full_like(hist_iid, i) for i, hist_iid in enumerate(history_item)])

你这个报错看上去是因为history_item是一个空数组导致的，能否检查一下full_sort_scores函数中的uid_series和test_data.uid2history_item的情况，或者提供更多的输入信息以便我们检查？

dxjjhm · 2021-05-25T01:28:45Z

但是问题是，切换到cpu模式就没有问题
使用如下语句加载
checkpoint = torch.load(light_model_path, map_location='cpu')

2017pxy · 2021-07-19T10:31:45Z

@dxjjhm 你好，你提供的信息非常有限，我们无法进行bug复现，请你提供完整代码和必要的复现信息

zrg1993 · 2021-10-04T11:40:26Z

I got the same error. I only add "repeatable: True" to the knowledge_base.yaml file in ./recbole/properties/quick_start_config/, and run KGAT model.

Sherry-XLL · 2021-10-04T11:55:54Z

@zrg1993 Hello, I modify the knowledge_base.yaml file and run KGAT model on the GPU with the command python run_recbole.py --model=KGAT, but there is no error. Could you please provide complete error reporting information?

zrg1993 · 2021-10-06T04:47:35Z

I found that this error happened when one item only have one interation with one user in the dataset. So I think this case results into a error in train/test dataset splitting. Please have a try on adding one new item to the ml-100k.item and one <user, the new item> row into ml-100k.inter. And run the case_study.py on the new item with any user, you will find the error.

zrg1993 · 2021-10-06T05:21:16Z

I also found this example runs well when uid_series = dataset.token2id(dataset.uid_field, ['196', '186']) . However, uid_series = dataset.token2id(dataset.uid_field, ['196']) got an error.

FIX: fix issue #834

Sherry-XLL · 2021-10-07T06:45:15Z

@zrg1993 We have fixed case_study.py in #986, thanks for your attention and suggestions.

zrg1993 · 2021-10-17T13:45:27Z

@Sherry-XLL Many thanks. It's working well now.

dxjjhm added the bug Something isn't working label May 14, 2021

2017pxy assigned chenyushuo May 14, 2021

dxjjhm changed the title ~~LightGCN model,使用 GPU训练， CPU加载，预测报错。~~ LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 May 27, 2021

Sherry-XLL added a commit to Sherry-XLL/RecBole that referenced this issue Oct 6, 2021

FIX: fix issue RUCAIBox#834

45df526

Sherry-XLL added a commit that referenced this issue Oct 7, 2021

Merge pull request #986 from Sherry-XLL/master

f56bcd0

FIX: fix issue #834

2017pxy closed this as completed Feb 25, 2022

Sherry-XLL added the case study label Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 #834

LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 #834

dxjjhm commented May 14, 2021

chenyushuo commented May 14, 2021

dxjjhm commented May 24, 2021 •

edited by chenyushuo

Loading

dxjjhm commented May 24, 2021

chenyushuo commented May 25, 2021

dxjjhm commented May 25, 2021 •

edited

Loading

2017pxy commented Jul 19, 2021

zrg1993 commented Oct 4, 2021

Sherry-XLL commented Oct 4, 2021

zrg1993 commented Oct 6, 2021 •

edited

Loading

zrg1993 commented Oct 6, 2021

Sherry-XLL commented Oct 7, 2021

zrg1993 commented Oct 17, 2021

LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 #834

LightGCN model,使用 GPU训练， GPU模式下加载预测报错， CPU模式下加载预测正常 #834

Comments

dxjjhm commented May 14, 2021

chenyushuo commented May 14, 2021

dxjjhm commented May 24, 2021 • edited by chenyushuo Loading

dxjjhm commented May 24, 2021

chenyushuo commented May 25, 2021

dxjjhm commented May 25, 2021 • edited Loading

2017pxy commented Jul 19, 2021

zrg1993 commented Oct 4, 2021

Sherry-XLL commented Oct 4, 2021

zrg1993 commented Oct 6, 2021 • edited Loading

zrg1993 commented Oct 6, 2021

Sherry-XLL commented Oct 7, 2021

zrg1993 commented Oct 17, 2021

dxjjhm commented May 24, 2021 •

edited by chenyushuo

Loading

dxjjhm commented May 25, 2021 •

edited

Loading

zrg1993 commented Oct 6, 2021 •

edited

Loading