CUDA BFloat16 Refactor #10085

centwang · 2021-12-20T08:19:02Z

Previous code casted BFloat16 to CUDA's nv_bfloat16 type for calculation, which required A100 to run because nv_bfloat16's calculation can run on A100 only. PyTorch uses its own type c10::BFloat16 for calculation. This PR is to refactor our code to follow the same idea to use our own onnxruntime::BFloat16 for calculation. The general implemtation is to cast BFloat16 to float for calculation, and use nv_bfloat16 on A100 using macro CUDA_ARCH >= 800.

With this implementation, we can support BFloat16 on most of the Nvidia devices besides A100.

Tested the code using ORTModule in two ways (need latest nightly PyTorch and ONNX for some BFloat16 support):

Add cast to torch.bfloat16 in the Module, this can run on both V100 and A100, and can get same calculation results
Use torch.autocast. PyTorch supports BFloat16 autocast on A100 only. I tested both PyTorch and ORT using torch.autocast on A100 to run the BERT model from transformers. We can get the same result (ignoring the margin of error), and ORT's perf is better than PyTorch (same as autocast of Float16).

Note that PyTorch also uses its own type c10::Float16 type for float16 calculation in CUDA, but ORT casts to CUDA's half type. This is OK as half is supported by most of the Nvidia devices. This PR doesn't torch any logic related to the float16 case.

weixingzhang

Thanks for making change. In general, looks good to me.

weixingzhang · 2022-01-06T06:53:09Z

include/onnxruntime/core/framework/float16.h

-  explicit BFloat16(uint16_t v) : val(v) {}
-  explicit BFloat16(float v) {
+#if defined(USE_ROCM)
+  ORT_HOST_DEVICE BFloat16() = default;


Out of curiosity, why is the line above specific to ROCM?

I saw PyTorch does this way for all default constructors so I followed the same way. Maybe hipcc requires this? But I didn't find out any documentation to support this.

OK, let's leave it as it is and will re-visit when supporting BF16 on AMD GPU.

weixingzhang · 2022-01-06T06:54:22Z

include/onnxruntime/core/framework/float16.h

+  BFloat16() = default;
+#endif
+
+  struct FromBitsT {};


What is the reason to introduce struct FromBitsT?

This idea is from PyTorch. It means if it's initialized from FromBitsT, then the bits will assign to val directly (the real value of BFloat16 instance is not equal to bits), but if not, for example, BFloat16(unsigned short value), it will initialize a BFloat16 == value (but the val member in the object is not equal to value). This is critical for some casting case, for example, BFloat16(1), which casts int to BFloat16, if we don't have this FromBitsT, the complier will report error saying ambiguous constructors, it doesn't know which to choose from BFloat16(unsigned short) or BFloat16(float). Even we don't have such ambigous problem, if compiler chooses BFloat(unsigned short) to do the job but assign the 1 to val memer directly, we would get a wrong BFloat16 instance. Actually our MLFloat16 also has such bug, but we don't have code such as MLFloat16(1) so we haven't encountered the compiler error for now.

centwang and others added 6 commits December 17, 2021 14:59

Use BFLoat16 instead of nv_bfloat16 for CUDA

2b3f8f7

fix BFloat16 rocm

8815c58

remove gemm ut from rocm

5467e8e

remove reducesum ut from rocm

2cfec11

add biasgelu

2b8868e

add support for more ops

7dc37da

centwang added the training issues related to ONNX Runtime training; typically submitted using template label Dec 20, 2021

centwang requested review from SherlockNoMad and weixingzhang December 20, 2021 08:19

centwang added 3 commits December 20, 2021 17:01

Merge branch 'master' into weicwang/bfloat16

d071935

fix build error

a3c394a

op md update

882cb35

weixingzhang reviewed Jan 6, 2022

View reviewed changes

Merge branch 'master' into weicwang/bfloat16

8c3240d

weixingzhang approved these changes Jan 14, 2022

View reviewed changes

centwang merged commit 44e2db9 into master Jan 14, 2022

centwang deleted the weicwang/bfloat16 branch January 14, 2022 11:38

This was referenced Jan 29, 2022

[ROCm] BFloat16 support #10416

Merged

[ROCm] BFloat16 support #10447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA BFloat16 Refactor #10085

CUDA BFloat16 Refactor #10085

centwang commented Dec 20, 2021 •

edited

Loading

weixingzhang left a comment

weixingzhang Jan 6, 2022

centwang Jan 10, 2022

weixingzhang Jan 10, 2022

weixingzhang Jan 6, 2022

centwang Jan 10, 2022

CUDA BFloat16 Refactor #10085

CUDA BFloat16 Refactor #10085

Conversation

centwang commented Dec 20, 2021 • edited Loading

weixingzhang left a comment

Choose a reason for hiding this comment

weixingzhang Jan 6, 2022

Choose a reason for hiding this comment

centwang Jan 10, 2022

Choose a reason for hiding this comment

weixingzhang Jan 10, 2022

Choose a reason for hiding this comment

weixingzhang Jan 6, 2022

Choose a reason for hiding this comment

centwang Jan 10, 2022

Choose a reason for hiding this comment

centwang commented Dec 20, 2021 •

edited

Loading