[X-CLIP] Some errors related about cuda during runtime #113

zhengzehong331 · 2023-09-17T06:32:27Z

Thank you for your great works! I meet this problem when i train the model with hmdb_51 dataset:

[2023-09-17 14:18:47 ViT-B/16](main.py 181): INFO Train: [0/50][0/3383]	eta 0:49:54 lr 0.000000000	time 0.8851 (0.8851)	tot_loss 2.6029 (2.6029)	mem 8942MB
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "main.py", line 278, in <module>
    main(config)
  File "main.py", line 104, in main
    train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn)
  File "main.py", line 144, in train_one_epoch
    images, label_id = mixup_fn(images, label_id)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 57, in __call__
    **kwargs)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 214, in do_blending
    return self.do_mixup(imgs, label)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 202, in do_mixup
    mixed_imgs = lam * imgs + (1 - lam) * imgs[rand_index, :]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdb60c737d2 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x267df7a (0x7fdbb3c92f7a in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0x301898 (0x7fdc1608c898 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7fdb60c5c005 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1edf69 (0x7fdc15f78f69 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4e5818 (0x7fdc16270818 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7fdc16270b19 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: /root/miniconda3/envs/xclip/bin/python() [0x4a0a87]
frame #8: /root/miniconda3/envs/xclip/bin/python() [0x4b0858]
frame #9: /root/miniconda3/envs/xclip/bin/python() [0x4c5b50]
frame #10: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #11: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #12: /root/miniconda3/envs/xclip/bin/python() [0x4946f7]
frame #13: PyDict_SetItemString + 0x61 (0x499261 in /root/miniconda3/envs/xclip/bin/python)
frame #14: PyImport_Cleanup + 0x89 (0x56f719 in /root/miniconda3/envs/xclip/bin/python)
frame #15: Py_FinalizeEx + 0x67 (0x56b1a7 in /root/miniconda3/envs/xclip/bin/python)
frame #16: /root/miniconda3/envs/xclip/bin/python() [0x53fc79]
frame #17: _Py_UnixMain + 0x3c (0x53fb3c in /root/miniconda3/envs/xclip/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fdc1897d083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: /root/miniconda3/envs/xclip/bin/python() [0x53f9ee]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 15424) of binary: /root/miniconda3/envs/xclip/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
main.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-17_14:18:51
  host      : autodl-container-7850119152-163467d4
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 15424)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 15424
======================================================

I was very confused and tried many methods but couldn't solve it.

my gpu is NVIDIA GeForce RTX 2080 Ti * 1

The text was updated successfully, but these errors were encountered:

okkkkkkkkkkkkk · 2024-08-22T02:52:21Z

我也遇到了相同的问题，请问你的问题是怎么解决的

zhangyu-ntl · 2024-12-02T14:25:25Z

哈咯，请问你的torch版本是多少啊

okkkkkkkkkkkkk · 2024-12-03T01:29:14Z

我的问题解决了，使用的torch版本是torch 1.10.1+cu111     

…

------------------ 原始邮件 ------------------ 发件人: "microsoft/VideoX" ***@***.***>; 发送时间: 2024年12月2日(星期一) 晚上10:25 ***@***.***>; ***@***.******@***.***>; 主题: Re: [microsoft/VideoX] [X-CLIP] Some errors related about cuda during runtime (Issue #113) 哈咯，请问你的torch版本是多少啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

zhangyu-ntl · 2024-12-04T03:23:29Z

这个代码你跑通了嘛

…

---Original--- From: ***@***.***> Date: Tue, Dec 3, 2024 09:29 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [microsoft/VideoX] [X-CLIP] Some errors related about cuda duringruntime (Issue #113) 我的问题解决了，使用的torch版本是torch 1.10.1+cu111&nbsp; &nbsp; &nbsp;

------------------&nbsp;原始邮件&nbsp;------------------ 发件人: "microsoft/VideoX" ***@***.***&gt;; 发送时间:&nbsp;2024年12月2日(星期一) 晚上10:25 ***@***.***&gt;; ***@***.******@***.***&gt;; 主题:&nbsp;Re: [microsoft/VideoX] [X-CLIP] Some errors related about cuda during runtime (Issue #113) 哈咯，请问你的torch版本是多少啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***&gt; — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

zhangyu-ntl · 2024-12-04T03:39:50Z

我的问题解决了，使用的torch版本是torch 1.10.1+cu111
…
------------------ 原始邮件 ------------------ 发件人: "microsoft/VideoX" @.>; 发送时间: 2024年12月2日(星期一) 晚上10:25 @.>; @.@.>; 主题: Re: [microsoft/VideoX] [X-CLIP] Some errors related about cuda during runtime (Issue #113) 哈咯，请问你的torch版本是多少啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

哥们，你这个能跑通嘛？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X-CLIP] Some errors related about cuda during runtime #113

[X-CLIP] Some errors related about cuda during runtime #113

zhengzehong331 commented Sep 17, 2023

okkkkkkkkkkkkk commented Aug 22, 2024

zhangyu-ntl commented Dec 2, 2024

okkkkkkkkkkkkk commented Dec 3, 2024 via email

zhangyu-ntl commented Dec 4, 2024 via email

zhangyu-ntl commented Dec 4, 2024

[X-CLIP] Some errors related about cuda during runtime #113

[X-CLIP] Some errors related about cuda during runtime #113

Comments

zhengzehong331 commented Sep 17, 2023

okkkkkkkkkkkkk commented Aug 22, 2024

zhangyu-ntl commented Dec 2, 2024

okkkkkkkkkkkkk commented Dec 3, 2024 via email

zhangyu-ntl commented Dec 4, 2024 via email

zhangyu-ntl commented Dec 4, 2024