You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Colab, and I was only using CommensenseConversation as my dataset, everything was going fine until it started padding. For some reason padding stopped mid way and resulted in a "Killed" state. What could be the cause for this? Here is the output/logs.
/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING]
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
2023-11-12 22:41:06.178406: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.178464: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.178509: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.199910: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.199966: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.200004: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.217535: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.234851: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.370428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.370482: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.370521: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.393033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.547999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.550180: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.550241: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.597710: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:10.572711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.665962: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.727451: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.986807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Logging to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53
Creating data loader...
(…)cased/resolve/main/tokenizer_config.json: 100% 28.0/28.0 [00:00<00:00, 116kB/s]
(…)rt-base-uncased/resolve/main/config.json: 100% 570/570 [00:00<00:00, 2.54MB/s]
(…)bert-base-uncased/resolve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 5.43MB/s]
(…)base-uncased/resolve/main/tokenizer.json: 100% 466k/466k [00:00<00:00, 7.70MB/s]
initializing the random embeddings Embedding(30522, 128)
##############################
Loading text data...
##############################
Loading dataset qqp from datasets/CC...
Loading form the TRAIN set...
Data samples...
['jesus , what kind of concerts do you go to where people sucker punch you for being born tall ?', 'almost all of those sound awful . dr . ken sounds like it could be good , but that description is too vague to really tell anything . in chang we trust , or something .'] ['the kind that allow bitter short people in . so basically all of them .', "if he 's anything like his knocked up character i 'm sure it 'll be pretty funny ."]
RAM used: 1986.24 MB
Dataset({
features: ['src', 'trg'],
num_rows: 3382137
})
RAM used: 2643.07 MB
Running tokenizer on dataset (num_proc=4): 100% 3382137/3382137 [12:08<00:00, 4643.12 examples/s]
I'm using Colab, and I was only using CommensenseConversation as my dataset, everything was going fine until it started padding. For some reason padding stopped mid way and resulted in a "Killed" state. What could be the cause for this? Here is the output/logs.
The text was updated successfully, but these errors were encountered: