Padding during training results in a "Killed" #64

MKSharaf · 2023-11-12T23:09:37Z

I'm using Colab, and I was only using CommensenseConversation as my dataset, everything was going fine until it started padding. For some reason padding stopped mid way and resulted in a "Killed" state. What could be the cause for this? Here is the output/logs.

/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING]
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
2023-11-12 22:41:06.178406: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.178464: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.178509: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.199910: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.199966: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.200004: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.217535: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.234851: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.370428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.370482: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.370521: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.393033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.547999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.550180: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.550241: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.597710: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:10.572711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.665962: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.727451: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.986807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Logging to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53

Creating data loader...

(…)cased/resolve/main/tokenizer_config.json: 100% 28.0/28.0 [00:00<00:00, 116kB/s]
(…)rt-base-uncased/resolve/main/config.json: 100% 570/570 [00:00<00:00, 2.54MB/s]
(…)bert-base-uncased/resolve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 5.43MB/s]
(…)base-uncased/resolve/main/tokenizer.json: 100% 466k/466k [00:00<00:00, 7.70MB/s]
initializing the random embeddings Embedding(30522, 128)
##############################
Loading text data...
##############################
Loading dataset qqp from datasets/CC...

Loading form the TRAIN set...

Data samples...

['jesus , what kind of concerts do you go to where people sucker punch you for being born tall ?', 'almost all of those sound awful . dr . ken sounds like it could be good , but that description is too vague to really tell anything . in chang we trust , or something .'] ['the kind that allow bitter short people in . so basically all of them .', "if he 's anything like his knocked up character i 'm sure it 'll be pretty funny ."]
RAM used: 1986.24 MB
Dataset({
features: ['src', 'trg'],
num_rows: 3382137
})
RAM used: 2643.07 MB
Running tokenizer on dataset (num_proc=4): 100% 3382137/3382137 [12:08<00:00, 4643.12 examples/s]

tokenized_datasets Dataset({
features: ['input_id_x', 'input_id_y'],
num_rows: 3382137
})

tokenized_datasets...example [101, 4441, 1010, 2054, 2785, 1997, 6759, 2079, 2017, 2175, 2000, 2073, 2111, 26476, 8595, 2017, 2005, 2108, 2141, 4206, 1029, 102]

RAM used: 4182.97 MB
merge and mask: 100% 3382137/3382137 [02:41<00:00, 20891.13 examples/s]
RAM used: 6818.71 MB
padding: 65% 2207000/3382137 [03:08<01:16, 15381.54 examples/s]Killed

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding during training results in a "Killed" #64

Padding during training results in a "Killed" #64

MKSharaf commented Nov 12, 2023

Creating data loader...

Loading form the TRAIN set...

Data samples...

tokenized_datasets Dataset({

tokenized_datasets...example [101, 4441, 1010, 2054, 2785, 1997, 6759, 2079, 2017, 2175, 2000, 2073, 2111, 26476, 8595, 2017, 2005, 2108, 2141, 4206, 1029, 102]

Padding during training results in a "Killed" #64

Padding during training results in a "Killed" #64

Comments

MKSharaf commented Nov 12, 2023

Creating data loader...

Loading form the TRAIN set...

Data samples...

tokenized_datasets Dataset({

tokenized_datasets...example [101, 4441, 1010, 2054, 2785, 1997, 6759, 2079, 2017, 2175, 2000, 2073, 2111, 26476, 8595, 2017, 2005, 2108, 2141, 4206, 1029, 102]