2024-06-13T14:55:40 | __main__: Start training 2024-06-13T14:55:40 | __main__: Epoch: 0 2024-06-13T14:55:40 | dataset.dataloader: Do not skip steps for any dataloader! 2024-06-13T14:55:42 | dataset.dataloader: MetaLoader has 2 dataloaders, 153073 batches in total dataloader index=0 name=image, batch-size=1 length(#batches)=26674 dataloader index=1 name=video, batch-size=1 length(#batches)=126399 WARNING 2024-06-13T14:55:43 | py.warnings: /workspace/data/code/vc2_hd/utils/distributed.py:18: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. builtin_warn(*args, **kwargs) WARNING 2024-06-13T14:55:43 | py.warnings: /workspace/data/code/vc2_hd/utils/distributed.py:18: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. builtin_warn(*args, **kwargs) The current flash attention version does not support sliding window attention, for a more memory efficient implementation make sure to upgrade flash-attn library. WARNING 2024-06-13T14:55:44 | py.warnings: /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass WARNING 2024-06-13T14:55:44 | py.warnings: /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1] bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. Traceback (most recent call last): File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in main(cfg) File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main global_step = train( File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. [2024-06-13 14:55:50,634] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 185782) of binary: /root/anaconda3/envs/vm/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/vm/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ tasks/train_it_ds.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 185783) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 2 (local_rank: 2) exitcode : 1 (pid: 185784) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 3 (local_rank: 3) exitcode : 1 (pid: 185785) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 4 (local_rank: 4) exitcode : 1 (pid: 185786) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 5 (local_rank: 5) exitcode : 1 (pid: 185787) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 6 (local_rank: 6) exitcode : 1 (pid: 185788) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 7 (local_rank: 7) exitcode : 1 (pid: 185789) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-13_14:55:50 host : hgx-a800-091.nxchinamobile.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 185782) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================