Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGCN model,使用 GPU训练, GPU模式下加载预测报错, CPU模式下加载预测正常 #834

Closed
dxjjhm opened this issue May 14, 2021 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@dxjjhm
Copy link

dxjjhm commented May 14, 2021

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. extra yaml file
  2. your code
  3. script for running

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Colab Links
If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

  • OS: [e.g. Linux, macOS or Windows]
  • RecBole Version [e.g. 0.1.0]
  • Python Version [e.g. 3.79]
  • PyTorch Version [e.g. 1.60]
  • cudatoolkit Version [e.g. 9.2, none]
@dxjjhm dxjjhm added the bug Something isn't working label May 14, 2021
@chenyushuo
Copy link
Collaborator

能否提供一下报错信息,以及yaml文件

@dxjjhm
Copy link
Author

dxjjhm commented May 24, 2021

Traceback (most recent call last):
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "lightgcnSDYD.py", line 60, in predict
    topk_score, topk_iid_list = full_sort_topk(uid_series, lightgcn_model, test_data, 10)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 87, in full_sort_topk
    scores = full_sort_scores(uid_series, model, test_data)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 45, in full_sort_scores
    history_row = torch.cat([torch.full_like(hist_iid, i) for i, hist_iid in enumerate(history_item)])
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/CPUType.cpp:2127 [kernel]
CUDA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/CUDAType.cpp:2983 [kernel]
QuantizedCPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/build/aten/src/ATen/QuantizedCPUType.cpp:297 [kernel]
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/VariableType_2.cpp:8078 [autograd kernel]
Tracer: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/autograd/generated/TraceType_2.cpp:9654 [kernel]
Autocast: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/autocast_mode.cpp:258 [kernel]
Batched: registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

@dxjjhm
Copy link
Author

dxjjhm commented May 24, 2021

使用GPU训练,然后GPU加载预测,就会爆出如上错误,改成CPU加载并预测没有错误。

@chenyushuo
Copy link
Collaborator

File "/data/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/recbole-0.2.1-py3.7.egg/recbole/utils/case_study.py", line 45, in full_sort_scores
history_row = torch.cat([torch.full_like(hist_iid, i) for i, hist_iid in enumerate(history_item)])

你这个报错看上去是因为history_item是一个空数组导致的,能否检查一下full_sort_scores函数中的uid_seriestest_data.uid2history_item的情况,或者提供更多的输入信息以便我们检查?

@dxjjhm
Copy link
Author

dxjjhm commented May 25, 2021

但是问题是,切换到cpu模式就没有问题
使用如下语句加载
checkpoint = torch.load(light_model_path, map_location='cpu')

@dxjjhm dxjjhm changed the title LightGCN model,使用 GPU训练, CPU加载,预测报错。 LightGCN model,使用 GPU训练, GPU模式下加载预测报错, CPU模式下加载预测正常 May 27, 2021
@2017pxy
Copy link
Member

2017pxy commented Jul 19, 2021

@dxjjhm 你好,你提供的信息非常有限,我们无法进行bug复现,请你提供完整代码和必要的复现信息

@zrg1993
Copy link

zrg1993 commented Oct 4, 2021

I got the same error. I only add "repeatable: True" to the knowledge_base.yaml file in ./recbole/properties/quick_start_config/, and run KGAT model.

@Sherry-XLL
Copy link
Member

@zrg1993 Hello, I modify the knowledge_base.yaml file and run KGAT model on the GPU with the command python run_recbole.py --model=KGAT, but there is no error. Could you please provide complete error reporting information?

@zrg1993
Copy link

zrg1993 commented Oct 6, 2021

I found that this error happened when one item only have one interation with one user in the dataset. So I think this case results into a error in train/test dataset splitting. Please have a try on adding one new item to the ml-100k.item and one <user, the new item> row into ml-100k.inter. And run the case_study.py on the new item with any user, you will find the error.

@zrg1993
Copy link

zrg1993 commented Oct 6, 2021

I also found this example runs well when uid_series = dataset.token2id(dataset.uid_field, ['196', '186']) . However, uid_series = dataset.token2id(dataset.uid_field, ['196']) got an error.
image

Sherry-XLL added a commit to Sherry-XLL/RecBole that referenced this issue Oct 6, 2021
Sherry-XLL added a commit that referenced this issue Oct 7, 2021
@Sherry-XLL
Copy link
Member

@zrg1993 We have fixed case_study.py in #986, thanks for your attention and suggestions.

@zrg1993
Copy link

zrg1993 commented Oct 17, 2021

@Sherry-XLL Many thanks. It's working well now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants