2024-06-13T14:55:40 | __main__: Start training
2024-06-13T14:55:40 | __main__: Epoch: 0
2024-06-13T14:55:40 | dataset.dataloader: Do not skip steps for any dataloader!
2024-06-13T14:55:42 | dataset.dataloader: MetaLoader has 2 dataloaders, 153073 batches in total
dataloader index=0 name=image, batch-size=1 length(#batches)=26674 
dataloader index=1 name=video, batch-size=1 length(#batches)=126399 
WARNING 2024-06-13T14:55:43 | py.warnings: /workspace/data/code/vc2_hd/utils/distributed.py:18: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  builtin_warn(*args, **kwargs)

WARNING 2024-06-13T14:55:43 | py.warnings: /workspace/data/code/vc2_hd/utils/distributed.py:18: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  builtin_warn(*args, **kwargs)

The current flash attention version does not support sliding window attention, for a more memory efficient implementation make sure to upgrade flash-attn library.
WARNING 2024-06-13T14:55:44 | py.warnings: /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
WARNING 2024-06-13T14:55:44 | py.warnings: /root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 32, 768], strides() = [73728, 768, 1]
bucket_view.sizes() = [1, 32, 768], strides() = [24576, 768, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:322.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Traceback (most recent call last):
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 291, in <module>
    main(cfg)
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 226, in main
    global_step = train(
  File "/workspace/data/code/vc2_hd/tasks/train_it_ds.py", line 85, in train
    scaler.scale(loss).backward()
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 302 with name vision_encoder.encoder.blocks.22.mlp.fc2.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
[2024-06-13 14:55:50,634] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 185782) of binary: /root/anaconda3/envs/vm/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/vm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/vm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tasks/train_it_ds.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 185783)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 185784)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 185785)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 185786)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 185787)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 185788)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 185789)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-13_14:55:50
  host      : hgx-a800-091.nxchinamobile.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 185782)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================