-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Following the Bert-finetuning tutorial results in ImportError
or IsADirectoryError:
for run_squad_baseline.sh
#474
Comments
@Santosh-Gupta, thanks for using DeepSpeed. The second argument to script is the model file itself rather than a folder. Please see here and here for details. |
Thanks for the info. It looks like I need to point to the checkpoint file in particular. So for a Tensorflow model, point to the model.ckpt.index (or is it the model.ckpt.meta ? ), and for a huggingface model, just point to the model.bin. It seems that some of the model types need more than one file to be fully defined, I'm guessing the library will search the containing folder to search for any other files it needs, such as the config files. Is that what is going on, or is it somehow just using the checkpoint file? |
@Santosh-Gupta Did you report a |
Yes, sorry I noticed in the code that the model used was bert-large-cased where I was using bert-base-uncased so I wanted to see if switching the model made a difference, but I'm still getting errors. For the following, I pointed the model file path to the .bin huggingface file, running run_squad_deepspeed.sh I first tried running the code in a jupyter notebook, the server running on the deepspeed container. This was the full output
I then tried running it directly in the deepspeed docker container terminal, in case there was an issue with jupyter, since there seems to be a different error.
I get the same errors when pointing to a tensorflow .ckpt.index file In both cases, the issue seems to be due to loading the model. If it helps, I am able to run other pytorch training code in the container. I also tried running run_squad_baseline.sh, and also got errors
And for terminal I am getting
|
These new import errors suggest a mismatch in cuda, apex, or torch versions. Can you double check those?
|
torch is version 1.6.0 I see that the latest Cuda version is 11.1, I'll upgrade it and check if that solves the issue. |
Actually can you try out these sequence of commands in python to test compatibility of cuda, torch, and apex fusedlayernorm?
|
Running this resulted in an error for the 4th line, here's the output
|
This confirms an incompatibility issue independent of deepspeed. I vaguely recall that either torch 1.6.0 or apex 0.1 requires cuda 10.1, and so upgrading cuda should fix the problem. For reference my cuda/torch/apex versions are |
Great, thanks tjruwase, I'll upgrade it and report back the results. |
I am wondering if the deepspeed docker image has an outdated version of cuda, that's what it seems like here https://github.com/microsoft/DeepSpeed/blob/master/docker/Dockerfile#L1 Currently
Even though we have recently installed 11.1 on our machine. I created a fresh docker container from the original image, and it is still showing V10.0.130. |
Yes, the deepspeed docker image is cuda 10.0, which is a bit confusing since it does not work with torch 1.6.0. However, it does work with torch 1.5.0 which the deepspeed release was tested against. So it seems the options are (1) Downgrade to torch 1.5.0 to use cuda 10.0, or (2) Upgrade docker file to cuda 10.1 to use torch 1.6.0. Do either of these options work for you? |
Ahh I see. Yeah downgrading python should work; Cuda seems to be very tricky to work with on our machines. I'll downgrade python and report back the results. |
I downgraded my torch version to 1.5.0 to work with the official docker image, but I am still getting an error for that code snippet to test the compatibility against.
and 'torch.version' gives 1.5.0 but
gives
|
* Merge chatgpt v2 to v3 - finalized (#484) * [squash] staging chatgpt v1 (#463) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> * [partial] formatting fixes * quantizer fixes * fix for bert tests * formatting fixes * re-enable _param_slice_mappings in z2 * Enable the QKV requires_grad when in training mode (#466) Co-authored-by: Jeff Rasley <[email protected]> * fixes for attention enable_training flag * commit to trigger CI * fix for distil-bert param * fixes for training context errors * remove reza's qkv-optimization (#469) Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt - Fuse lora params at HybridEngine (#472) Co-authored-by: Jeff Rasley <[email protected]> * add option to enable non-pin mode (#473) * Chatgpt - fuse lora non pinned case (#474) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * fix the multiple issue for lora parameters * formatting * fuse lora only when available --------- Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt/release inference cache (#475) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * release/retake the inference cache after/before generate * remove duplicated _fuse_lora function * fix formatting * fix hybrid-engine config issue * update formatting * Chatgpt - fuse qkv v2 (#478) Co-authored-by: Jeff Rasley <[email protected]> * ChatGPT: Refactor Hybrid Engine Config (#477) Co-authored-by: Lok Chand Koppaka <[email protected]> * Inference Workspace Tweaks (#481) * Safety checks around inference workspace allocation, extra flushing * Formatting fixes * Merge fix * Chatgpt/inference tp (#480) * Update the merged-QKV weights only if there is difference with the model parameter * remove the hard-coded size * always reset qkv params to updated ones after running step * Add the infernce-tp group and tensor sharding to run inference in model-parallel mode * optimize the gather/mp-sharding part * Add hybrid_engine changes * fix config issue * Formatting fixes. Reset_qkv duplicate removal. * fix bloom container. * fix format. --------- Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> * fix formatting * more clean-up --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * fix a bug on lora-fusion (#487) * Cholmes/v3 workspace bugfixes (#488) * Miscellaneous workspace fixes, new config param * Fix typo --------- Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]>
I followed the getting started directions here
https://www.deepspeed.ai/tutorials/bert-finetuning/
I pulled the docker image and started a container.
I ran the following commands in a Jupyter notebook (server running in the container)
Neither the tf or hf versions of the models are working. This is a sample output from the baselines
I tried both the hf and tf versions of the model because it looked like the error was related to the model initialization.
This info might be helpful; in the same notebook I ran another pytorch training script without any errors.
I tried running
run_squad_baseline.sh
outside the jupyter notebook, directly in terminal. For both the hf and tf versions, I get a different error; it looks like it's not able to load the model from the directory. Here is a sample output.The text was updated successfully, but these errors were encountered: