Following the Bert-finetuning tutorial results in `ImportError` or `IsADirectoryError:` for run_squad_baseline.sh #474

Santosh-Gupta · 2020-10-16T05:58:35Z

I followed the getting started directions here

https://www.deepspeed.ai/tutorials/bert-finetuning/

I pulled the docker image and started a container.

I ran the following commands in a Jupyter notebook (server running in the container)

%set_env CUDA_VISIBLE_DEVICES=0,1,2
%cd /home/santosh/Projects/MsZeroTS
!git clone https://github.com/microsoft/DeepSpeed
!mkdir tfModel
!mkdir hfModel

#Save a HF version of the model
!pip install transformers
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-cased')
model.save_pretrained('/home/santosh/Projects/MsZeroTS/hfModel')

#Save a tf version of the model 
%cd /home/santosh/Projects/MsZeroTS/tfModel
!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
!unzip -q cased_L-12_H-768_A-12.zip
%cd ..

%cd DeepSpeed
!git submodule update --init --recursive
%cd DeepSpeedExamples/BingBertSquad

#Download data 
!mkdir Data
%cd Data
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
%cd ..

#try tf version
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/TestModel/cased_L-12_H-768_A-12 /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output1

#try hf version 
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output2

Neither the tf or hf versions of the models are working. This is a sample output from the baselines

lr is 0.00003
seed is 12345
master port is 29500
dropout is 0.1
deepspeed --num_nodes 1 --num_gpus 3 --master_port=29500 --hostfile /dev/null nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:19,447] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,465] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,469] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 04:30:19,533] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:20,172] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,190] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,193] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 04:30:20,193] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 04:30:20,194] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 04:30:20,194] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 04:30:20,194] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 04:30:20,194] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2020-10-16 04:30:20,899] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,917] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,927] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,945] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,985] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:21,003] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

I tried both the hf and tf versions of the model because it looked like the error was related to the model initialization.

This info might be helpful; in the same notebook I ran another pytorch training script without any errors.

I tried running run_squad_baseline.sh outside the jupyter notebook, directly in terminal. For both the hf and tf versions, I get a different error; it looks like it's not able to load the model from the directory. Here is a sample output.

/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad$ bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel/ /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output12
deepspeed --num_nodes 1 --num_gpus 3 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:17,877] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 05:23:17,910] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:18,468] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 05:23:18,469] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 05:23:18,469] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 05:23:18,469] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 05:23:18,469] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 05:23:18,469] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'

The text was updated successfully, but these errors were encountered:

tjruwase · 2020-10-17T22:46:38Z

@Santosh-Gupta, thanks for using DeepSpeed.

The second argument to script is the model file itself rather than a folder. Please see here and here for details.

Santosh-Gupta · 2020-10-17T23:30:31Z

Thanks for the info. It looks like I need to point to the checkpoint file in particular. So for a Tensorflow model, point to the model.ckpt.index (or is it the model.ckpt.meta ? ), and for a huggingface model, just point to the model.bin.

It seems that some of the model types need more than one file to be fully defined, I'm guessing the library will search the containing folder to search for any other files it needs, such as the config files. Is that what is going on, or is it somehow just using the checkpoint file?

tjruwase · 2020-10-22T21:57:00Z

@Santosh-Gupta Did you report a Default process group is not initialized error?

Santosh-Gupta · 2020-10-23T09:19:36Z

@Santosh-Gupta Did you report a Default process group is not initialized error?

Yes, sorry I noticed in the code that the model used was bert-large-cased where I was using bert-base-uncased so I wanted to see if switching the model made a difference, but I'm still getting errors.

For the following, I pointed the model file path to the .bin huggingface file, running run_squad_deepspeed.sh

I first tried running the code in a jupyter notebook, the server running on the deepspeed container. This was the full output

lr is 0.00003
seed is 12345
master port is 29500
dropout is 0.1
deepspeed --num_nodes 1 --num_gpus 3 --master_port=29500 --hostfile /dev/null nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1 --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin --seed 12345 --preln
[2020-10-23 08:33:23,956] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-23 08:33:24,011] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1 --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin --seed 12345 --preln
[2020-10-23 08:33:24,576] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-23 08:33:24,576] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-23 08:33:24,576] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-23 08:33:24,576] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-23 08:33:24,576] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-23 08:33:24,576] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/23/2020 08:33:25 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at "/home/santosh"/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 08:33:25 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at "/home/santosh"/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 08:33:25 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at "/home/santosh"/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 930, in __init__
    self.apply(self.init_bert_weights)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 294, in apply
    fn(self)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 730, in init_bert_weights
    if torch.distributed.get_rank() == 0:
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 562, in get_rank
    _check_default_pg()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 930, in __init__
    self.apply(self.init_bert_weights)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 294, in apply
    fn(self)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 730, in init_bert_weights
    if torch.distributed.get_rank() == 0:
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 562, in get_rank
    _check_default_pg()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 930, in __init__
    self.apply(self.init_bert_weights)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 293, in apply
    module.apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 294, in apply
    fn(self)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 730, in init_bert_weights
    if torch.distributed.get_rank() == 0:
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 562, in get_rank
    _check_default_pg()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

I then tried running it directly in the deepspeed docker container terminal, in case there was an issue with jupyter, since there seems to be a different error.

lr is 0.00003
seed is 12345
master port is 29500
dropout is 0.1
deepspeed --num_nodes 1 --num_gpus 3 --master_port=29500 --hostfile /dev/null nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1ab --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin --seed 12345 --preln
[2020-10-23 09:11:34,436] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:34,454] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:34,458] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-23 09:11:34,514] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1ab --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin --seed 12345 --preln
[2020-10-23 09:11:35,131] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,150] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,153] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-23 09:11:35,154] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-23 09:11:35,154] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-23 09:11:35,154] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-23 09:11:35,154] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-23 09:11:35,154] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2020-10-23 09:11:35,886] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,897] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,905] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,916] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,953] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:11:35,971] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
10/23/2020 09:11:36 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:11:36 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:11:36 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

I get the same errors when pointing to a tensorflow .ckpt.index file

In both cases, the issue seems to be due to loading the model. If it helps, I am able to run other pytorch training code in the container.

I also tried running run_squad_baseline.sh, and also got errors

deepspeed --num_nodes 1 --num_gpus 3 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1a --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin
[2020-10-23 09:23:19,668] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:23:19,685] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:23:19,689] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-23 09:23:19,744] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1a --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin
[2020-10-23 09:23:20,360] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:23:20,377] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:23:20,380] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-23 09:23:20,380] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-23 09:23:20,380] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-23 09:23:20,380] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-23 09:23:20,380] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-23 09:23:20,380] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/23/2020 09:23:21 - INFO - __main__ -   device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:23:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:23:22 - INFO - __main__ -   device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:23:22 - INFO - __main__ -   device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:23:22 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:23:22 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

And for terminal I am getting

deepspeed --num_nodes 1 --num_gpus 3 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1ak --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin
[2020-10-23 09:26:40,968] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:26:40,987] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:26:40,991] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-23 09:26:41,046] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/projects/deepSpeed/testData/train-v1.1.json --predict_file /home/santosh/projects/deepSpeed/testData/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/projects/deepSpeed/outputs/a1ak --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/projects/deepSpeed/testModels/hf/pytorch_model.bin
[2020-10-23 09:26:41,712] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:26:41,730] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-23 09:26:41,733] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-23 09:26:41,733] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-23 09:26:41,733] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-23 09:26:41,733] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-23 09:26:41,733] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-23 09:26:41,733] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/23/2020 09:26:42 - INFO - __main__ -   device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:26:42 - INFO - __main__ -   device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:26:42 - INFO - __main__ -   device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2020 09:26:42 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:26:42 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/23/2020 09:26:42 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 866, in main
    model = BertForQuestionAnswering(bert_config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1472, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 900, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/projects/deepSpeed/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 347, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

tjruwase · 2020-10-23T23:37:14Z

These new import errors suggest a mismatch in cuda, apex, or torch versions. Can you double check those?

ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

Santosh-Gupta · 2020-10-26T00:07:29Z

These new import errors suggest a mismatch in cuda, apex, or torch versions. Can you double check those?

ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

torch is version 1.6.0
apex is Version: 0.1
Cuda is Version 10.0.130

I see that the latest Cuda version is 11.1, I'll upgrade it and check if that solves the issue.

tjruwase · 2020-10-26T04:50:16Z

Actually can you try out these sequence of commands in python to test compatibility of cuda, torch, and apex fusedlayernorm?

>>> import torch
>>> import apex
>>> input = torch.randn(20, 5, 10, 10)
>>> m = apex.normalization.FusedLayerNorm(input.size()[1:])
>>> output = m(input)

Santosh-Gupta · 2020-10-26T06:05:43Z

import torch
import apex
input = torch.randn(20, 5, 10, 10)
m = apex.normalization.FusedLayerNorm(input.size()[1:])
output = m(input)

Running this resulted in an error for the 4th line, here's the output

>>> import torch
>>> import apex
>>> input = torch.randn(20, 5, 10, 10)
>>> m = apex.normalization.FusedLayerNorm(input.size()[1:])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

tjruwase · 2020-10-26T06:44:51Z

This confirms an incompatibility issue independent of deepspeed. I vaguely recall that either torch 1.6.0 or apex 0.1 requires cuda 10.1, and so upgrading cuda should fix the problem. For reference my cuda/torch/apex versions are
cuda 10.1
torch 1.6.0
apex 0.1

Santosh-Gupta · 2020-10-26T18:09:23Z

Great, thanks tjruwase, I'll upgrade it and report back the results.

Santosh-Gupta · 2020-10-28T19:53:42Z

This confirms an incompatibility issue independent of deepspeed. I vaguely recall that either torch 1.6.0 or apex 0.1 requires cuda 10.1, and so upgrading cuda should fix the problem. For reference my cuda/torch/apex versions are
cuda 10.1
torch 1.6.0
apex 0.1

I am wondering if the deepspeed docker image has an outdated version of cuda, that's what it seems like here

https://github.com/microsoft/DeepSpeed/blob/master/docker/Dockerfile#L1

Currently nvcc -V in the deepspeed container is showing

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Even though we have recently installed 11.1 on our machine. I created a fresh docker container from the original image, and it is still showing V10.0.130.

tjruwase · 2020-10-28T20:08:55Z

Yes, the deepspeed docker image is cuda 10.0, which is a bit confusing since it does not work with torch 1.6.0. However, it does work with torch 1.5.0 which the deepspeed release was tested against. So it seems the options are (1) Downgrade to torch 1.5.0 to use cuda 10.0, or (2) Upgrade docker file to cuda 10.1 to use torch 1.6.0. Do either of these options work for you?

Santosh-Gupta · 2020-10-28T20:29:07Z

Yes, the deepspeed docker image is cuda 10.0, which is a bit confusing since it does not work with torch 1.6.0. However, it does work with torch 1.5.0 which the deepspeed release was tested against. So it seems the options are (1) Downgrade to torch 1.5.0 to use cuda 10.0, or (2) Upgrade docker file to cuda 10.1 to use torch 1.6.0. Do either of these options work for you?

Ahh I see. Yeah downgrading python should work; Cuda seems to be very tricky to work with on our machines. I'll downgrade python and report back the results.

Santosh-Gupta · 2020-11-05T07:53:59Z

Yes, the deepspeed docker image is cuda 10.0, which is a bit confusing since it does not work with torch 1.6.0. However, it does work with torch 1.5.0 which the deepspeed release was tested against. So it seems the options are (1) Downgrade to torch 1.5.0 to use cuda 10.0, or (2) Upgrade docker file to cuda 10.1 to use torch 1.6.0. Do either of these options work for you?

I downgraded my torch version to 1.5.0 to work with the official docker image, but I am still getting an error for that code snippet to test the compatibility against.

nvcc -V gives

release 10.0, V10.0.130

and 'torch.version' gives 1.5.0

but

import torch
import apex
input = torch.randn(20, 5, 10, 10)
m = apex.normalization.FusedLayerNorm(input.size()[1:])
output = m(input)

gives

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-8-ff19223b4d5c> in <module>
      4 import apex
      5 input = torch.randn(20, 5, 10, 10)
----> 6 m = apex.normalization.FusedLayerNorm(input.size()[1:])
      7 output = m(input)

/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py in __init__(self, normalized_shape, eps, elementwise_affine)
    131 
    132         global fused_layer_norm_cuda
--> 133         fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
    134 
    135         if isinstance(normalized_shape, numbers.Integral):

/usr/lib/python3.6/importlib/__init__.py in import_module(name, package)
    124                 break
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 
    128 

/usr/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)

/usr/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

/usr/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/usr/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

/usr/lib/python3.6/importlib/_bootstrap.py in module_from_spec(spec)

/usr/lib/python3.6/importlib/_bootstrap_external.py in create_module(self, spec)

/usr/lib/python3.6/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

* Merge chatgpt v2 to v3 - finalized (#484) * [squash] staging chatgpt v1 (#463) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> * [partial] formatting fixes * quantizer fixes * fix for bert tests * formatting fixes * re-enable _param_slice_mappings in z2 * Enable the QKV requires_grad when in training mode (#466) Co-authored-by: Jeff Rasley <[email protected]> * fixes for attention enable_training flag * commit to trigger CI * fix for distil-bert param * fixes for training context errors * remove reza's qkv-optimization (#469) Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt - Fuse lora params at HybridEngine (#472) Co-authored-by: Jeff Rasley <[email protected]> * add option to enable non-pin mode (#473) * Chatgpt - fuse lora non pinned case (#474) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * fix the multiple issue for lora parameters * formatting * fuse lora only when available --------- Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt/release inference cache (#475) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * release/retake the inference cache after/before generate * remove duplicated _fuse_lora function * fix formatting * fix hybrid-engine config issue * update formatting * Chatgpt - fuse qkv v2 (#478) Co-authored-by: Jeff Rasley <[email protected]> * ChatGPT: Refactor Hybrid Engine Config (#477) Co-authored-by: Lok Chand Koppaka <[email protected]> * Inference Workspace Tweaks (#481) * Safety checks around inference workspace allocation, extra flushing * Formatting fixes * Merge fix * Chatgpt/inference tp (#480) * Update the merged-QKV weights only if there is difference with the model parameter * remove the hard-coded size * always reset qkv params to updated ones after running step * Add the infernce-tp group and tensor sharding to run inference in model-parallel mode * optimize the gather/mp-sharding part * Add hybrid_engine changes * fix config issue * Formatting fixes. Reset_qkv duplicate removal. * fix bloom container. * fix format. --------- Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> * fix formatting * more clean-up --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * fix a bug on lora-fusion (#487) * Cholmes/v3 workspace bugfixes (#488) * Miscellaneous workspace fixes, new config param * Fix typo --------- Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following the Bert-finetuning tutorial results in `ImportError` or `IsADirectoryError:` for run_squad_baseline.sh #474

Following the Bert-finetuning tutorial results in `ImportError` or `IsADirectoryError:` for run_squad_baseline.sh #474

Santosh-Gupta commented Oct 16, 2020

tjruwase commented Oct 17, 2020

Santosh-Gupta commented Oct 17, 2020

tjruwase commented Oct 22, 2020

Santosh-Gupta commented Oct 23, 2020 •

edited

Loading

tjruwase commented Oct 23, 2020

Santosh-Gupta commented Oct 26, 2020

tjruwase commented Oct 26, 2020

Santosh-Gupta commented Oct 26, 2020

tjruwase commented Oct 26, 2020

Santosh-Gupta commented Oct 26, 2020

Santosh-Gupta commented Oct 28, 2020

tjruwase commented Oct 28, 2020 •

edited

Loading

Santosh-Gupta commented Oct 28, 2020

Santosh-Gupta commented Nov 5, 2020

Following the Bert-finetuning tutorial results in ImportError or IsADirectoryError: for run_squad_baseline.sh #474

Following the Bert-finetuning tutorial results in ImportError or IsADirectoryError: for run_squad_baseline.sh #474

Comments

Santosh-Gupta commented Oct 16, 2020

tjruwase commented Oct 17, 2020

Santosh-Gupta commented Oct 17, 2020

tjruwase commented Oct 22, 2020

Santosh-Gupta commented Oct 23, 2020 • edited Loading

tjruwase commented Oct 23, 2020

Santosh-Gupta commented Oct 26, 2020

tjruwase commented Oct 26, 2020

Santosh-Gupta commented Oct 26, 2020

tjruwase commented Oct 26, 2020

Santosh-Gupta commented Oct 26, 2020

Santosh-Gupta commented Oct 28, 2020

tjruwase commented Oct 28, 2020 • edited Loading

Santosh-Gupta commented Oct 28, 2020

Santosh-Gupta commented Nov 5, 2020

Following the Bert-finetuning tutorial results in `ImportError` or `IsADirectoryError:` for run_squad_baseline.sh #474

Following the Bert-finetuning tutorial results in `ImportError` or `IsADirectoryError:` for run_squad_baseline.sh #474

Santosh-Gupta commented Oct 23, 2020 •

edited

Loading

tjruwase commented Oct 28, 2020 •

edited

Loading