fixing the run and model scripts for running the BingBertSquad #58

RezaYazdaniAminabadi · 2020-10-07T02:14:28Z

After refactoring the running scripts, we missed to pass the local_rank argument that the Transformer kernel requires to run on multiple GPUs. I add it to the transformer_kernel configuration. Also the torch.distributed needs to be initialized before the model is created in nvidia_run_squad_deepspeed.py, otherwise, it fails when running the baseline. The rest of the changes is due to the formatting!

fixing the run and model scripts for running the BingBertSquad

0a2dd4e

RezaYazdaniAminabadi requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam, ShadenSmith and tjruwase as code owners October 7, 2020 02:14

This was referenced Oct 7, 2020

CUDA Error when run with multiple GPUs microsoft/DeepSpeed#454

Closed

CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx when using deepspeed tranformer kernel microsoft/DeepSpeed#294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing the run and model scripts for running the BingBertSquad #58

fixing the run and model scripts for running the BingBertSquad #58

RezaYazdaniAminabadi commented Oct 7, 2020

fixing the run and model scripts for running the BingBertSquad #58

Are you sure you want to change the base?

fixing the run and model scripts for running the BingBertSquad #58

Conversation

RezaYazdaniAminabadi commented Oct 7, 2020