2024-06-21T18:04:56 | tasks.shared_utils_ds: Auto resuming 2024-06-21T18:04:56 | tasks.shared_utils_ds: Not found checkpoint in ./out0611/ 2024-06-21T18:04:56 | tasks.shared_utils_ds: Use deepspeed to initialize model!!! [2024-06-21 18:04:56,884] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown [2024-06-21 18:04:56,884] [INFO] [comm.py:637:init_distributed] cdb=None Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 292, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 205, in main ) = setup_model( File "/workspace/data/code/vc2_hd/tasks/shared_utils_ds.py", line 126, in setup_model model, optimizer, _, _ = deepspeed.initialize( File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/deepspeed/__init__.py", line 164, in initialize assert config is not None, "DeepSpeed requires --deepspeed_config to specify configuration file" AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 292, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 205, in main ) = setup_model( File "/workspace/data/code/vc2_hd/tasks/shared_utils_ds.py", line 126, in setup_model model, optimizer, _, _ = deepspeed.initialize( File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/deepspeed/__init__.py", line 164, in initialize assert config is not None, "DeepSpeed requires --deepspeed_config to specify configuration file" AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 292, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 205, in main ) = setup_model( File "/workspace/data/code/vc2_hd/tasks/shared_utils_ds.py", line 126, in setup_model model, optimizer, _, _ = deepspeed.initialize( File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/deepspeed/__init__.py", line 164, in initialize assert config is not None, "DeepSpeed requires --deepspeed_config to specify configuration file" AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file [2024-06-21 18:05:03,247] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54 closing signal SIGTERM [2024-06-21 18:05:03,512] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 51) of binary: /root/anaconda3/envs/vhd/bin/python3.10 Traceback (most recent call last): File "/root/anaconda3/envs/vhd/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/vhd/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ tasks/train_it_ds.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 52) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 2 (local_rank: 2) exitcode : 1 (pid: 53) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 4 (local_rank: 4) exitcode : 1 (pid: 55) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 5 (local_rank: 5) exitcode : 1 (pid: 56) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 6 (local_rank: 6) exitcode : 1 (pid: 57) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 7 (local_rank: 7) exitcode : 1 (pid: 58) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-21_18:05:03 host : hgx-a800-091.nxchinamobile.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 51) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================