Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #156

Open
jpainam opened this issue Feb 12, 2024 · 0 comments
Open

RuntimeError: CUDA error: an illegal memory access was encountered #156

jpainam opened this issue Feb 12, 2024 · 0 comments

Comments

@jpainam
Copy link

jpainam commented Feb 12, 2024

Hi
Has anyome managed to train multi-gpus? I'm using this command
python train_3d.py --outdir=./outdir --data=shapenet_get3d/img/03790512 --camera_path shapenet_get3d/camera --gpus=8 --batch=32 --gamma=40 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 0 --img_res=256 --kimg=200 --workers 1

Constructing networks...
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Setting up augmentation...
Distributing across 8 GPUs...
Traceback (most recent call last):
  File "train_3d.py", line 339, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_3d.py", line 333, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train_3d.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "~/GET3D/train_3d.py", line 51, in subprocess_fn
    training_loop_3d.training_loop(rank=rank, **c)
  File "~/GET3D/training/training_loop_3d.py", line 159, in training_loop
    G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(
  File "~/GET3D/dnnlib/util.py", line 306, in construct_class_by_name
    return call_func_by_name(*args, func_name=class_name, **kwargs)
  File "~/GET3D/dnnlib/util.py", line 301, in call_func_by_name
    return func_obj(*args, **kwargs)
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 599, in __init__
    self.synthesis = DMTETSynthesisNetwork(
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 81, in __init__
    self.dmtet_geometry = DMTetGeometry(
  File "~/GET3D/uni_rep/rep_3d/dmtet.py", line 423, in __init__
    all_edges_sorted = torch.sort(all_edges, dim=1)[0]
RuntimeError: CUDA error: an illegal memory access was encountered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant