I can only use single gpu train the model. #64

Eternal-Br · 2024-12-27T09:37:43Z

Hi guy, I can only training mbyolo_train.py on single GPU, that works fine.
When I use 2 gpus to train the model, I want use DDP, by the commands below, there were errors occurred.

Command 1 for training on single GPU:

python mbyolo_train.py --batch_size 32 --device 0,1 --workers 4 --epoch 32 --name mamba_20241227

Trace:
DDP: debug command /usr/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 33452 /root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py
Traceback (most recent call last):
File "/root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py", line 6, in
from ultralytics.models.yolo.detect.train import DetectionTrainer
ModuleNotFoundError: No module named 'ultralytics'
Traceback (most recent call last):
File "/root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py", line 6, in
from ultralytics.models.yolo.detect.train import DetectionTrainer
ModuleNotFoundError: No module named 'ultralytics'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2263223) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 798, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py FAILED

Failures:
[1]:
time : 2024-12-27_09:20:52
host : my-host.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2263230)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-27_09:20:52
host : my-host.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2263223)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last):
File "mbyolo_train.py", line 44, in
"train": YOLO(model_conf).train(**args),
File "/path/to/Mamba-YOLO/ultralytics/engine/model.py", line 674, in train
self.trainer.train()
File "/path/to/Mamba-YOLO/ultralytics/engine/trainer.py", line 194, in train
raise e
File "/path/to//Mamba-YOLO/ultralytics/engine/trainer.py", line 192, in train
subprocess.run(cmd, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '33452', '/root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py']' returned non-zero exit status 1.

Command 2 for training on 2 GPUs: python -m torch.distributed.launch --nproc_per_node 2 mbyolo_train.py --batch_size 32 --device 1,2 --workers 4 --epoch 32 --name mamba_20241227
Trace:
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: mbyolo_train.py [-h] [--data DATA] [--config CONFIG] [--batch_size BATCH_SIZE] [--imgsz IMGSZ] [--task TASK] [--device DEVICE]
[--workers WORKERS] [--epochs EPOCHS] [--optimizer OPTIMIZER] [--amp] [--project PROJECT] [--name NAME] [--half]
[--dnn]
mbyolo_train.py: error: unrecognized arguments: --local-rank=0
usage: mbyolo_train.py [-h] [--data DATA] [--config CONFIG] [--batch_size BATCH_SIZE] [--imgsz IMGSZ] [--task TASK] [--device DEVICE]
[--workers WORKERS] [--epochs EPOCHS] [--optimizer OPTIMIZER] [--amp] [--project PROJECT] [--name NAME] [--half]
[--dnn]
mbyolo_train.py: error: unrecognized arguments: --local-rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 2268165) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 196, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

mbyolo_train.py FAILED

Failures:
[1]:
time : 2024-12-27_09:24:52
host : my-host.com
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 2268166)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-27_09:24:52
host : my-host.com
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 2268165)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Do you have any ideas?? Thx.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can only use single gpu train the model. #64

I can only use single gpu train the model. #64

Eternal-Br commented Dec 27, 2024

I can only use single gpu train the model. #64

I can only use single gpu train the model. #64

Comments

Eternal-Br commented Dec 27, 2024

Command 1 for training on single GPU:

/root/.config/Ultralytics/DDP/_temp_pqb5bgkr140140367604032.py FAILED

Failures: [1]: time : 2024-12-27_09:20:52 host : my-host.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2263230) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-12-27_09:20:52 host : my-host.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 2263223) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

mbyolo_train.py FAILED

Failures: [1]: time : 2024-12-27_09:24:52 host : my-host.com rank : 1 (local_rank: 1) exitcode : 2 (pid: 2268166) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-12-27_09:24:52 host : my-host.com rank : 0 (local_rank: 0) exitcode : 2 (pid: 2268165) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2024-12-27_09:20:52
host : my-host.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2263230)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-27_09:20:52
host : my-host.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2263223)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2024-12-27_09:24:52
host : my-host.com
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 2268166)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-27_09:24:52
host : my-host.com
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 2268165)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html