Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X-CLIP] Some errors related about cuda during runtime #113

Open
zhengzehong331 opened this issue Sep 17, 2023 · 5 comments
Open

[X-CLIP] Some errors related about cuda during runtime #113

zhengzehong331 opened this issue Sep 17, 2023 · 5 comments

Comments

@zhengzehong331
Copy link

Thank you for your great works! I meet this problem when i train the model with hmdb_51 dataset:

[2023-09-17 14:18:47 ViT-B/16](main.py 181): INFO Train: [0/50][0/3383]	eta 0:49:54 lr 0.000000000	time 0.8851 (0.8851)	tot_loss 2.6029 (2.6029)	mem 8942MB
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "main.py", line 278, in <module>
    main(config)
  File "main.py", line 104, in main
    train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn)
  File "main.py", line 144, in train_one_epoch
    images, label_id = mixup_fn(images, label_id)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 57, in __call__
    **kwargs)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 214, in do_blending
    return self.do_mixup(imgs, label)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 202, in do_mixup
    mixed_imgs = lam * imgs + (1 - lam) * imgs[rand_index, :]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdb60c737d2 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x267df7a (0x7fdbb3c92f7a in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0x301898 (0x7fdc1608c898 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7fdb60c5c005 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1edf69 (0x7fdc15f78f69 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4e5818 (0x7fdc16270818 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7fdc16270b19 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: /root/miniconda3/envs/xclip/bin/python() [0x4a0a87]
frame #8: /root/miniconda3/envs/xclip/bin/python() [0x4b0858]
frame #9: /root/miniconda3/envs/xclip/bin/python() [0x4c5b50]
frame #10: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #11: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #12: /root/miniconda3/envs/xclip/bin/python() [0x4946f7]
frame #13: PyDict_SetItemString + 0x61 (0x499261 in /root/miniconda3/envs/xclip/bin/python)
frame #14: PyImport_Cleanup + 0x89 (0x56f719 in /root/miniconda3/envs/xclip/bin/python)
frame #15: Py_FinalizeEx + 0x67 (0x56b1a7 in /root/miniconda3/envs/xclip/bin/python)
frame #16: /root/miniconda3/envs/xclip/bin/python() [0x53fc79]
frame #17: _Py_UnixMain + 0x3c (0x53fb3c in /root/miniconda3/envs/xclip/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fdc1897d083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: /root/miniconda3/envs/xclip/bin/python() [0x53f9ee]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 15424) of binary: /root/miniconda3/envs/xclip/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
main.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-17_14:18:51
  host      : autodl-container-7850119152-163467d4
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 15424)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 15424
======================================================

I was very confused and tried many methods but couldn't solve it.

my gpu is NVIDIA GeForce RTX 2080 Ti * 1
image

@okkkkkkkkkkkkk
Copy link

我也遇到了相同的问题,请问你的问题是怎么解决的

@zhangyu-ntl
Copy link

哈咯,请问你的torch版本是多少啊

@okkkkkkkkkkkkk
Copy link

okkkkkkkkkkkkk commented Dec 3, 2024 via email

@zhangyu-ntl
Copy link

zhangyu-ntl commented Dec 4, 2024 via email

@zhangyu-ntl
Copy link

我的问题解决了,使用的torch版本是torch 1.10.1+cu111     

------------------ 原始邮件 ------------------ 发件人: "microsoft/VideoX" @.>; 发送时间: 2024年12月2日(星期一) 晚上10:25 @.>; @.@.>; 主题: Re: [microsoft/VideoX] [X-CLIP] Some errors related about cuda during runtime (Issue #113) 哈咯,请问你的torch版本是多少啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

哥们,你这个能跑通嘛?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants