-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT fp16 accuracy problem #1196
Comments
Hello @chenzhanyiczy , Also have your set strict types when try mixed precision?
if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32
Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement. |
@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks! |
yes, I also use this flag(STRICT_TYPES) and set previous layer output type to FLOAT, but accuracy also has much diff. The builder's code is similar to the following.
The network structure is like this: How does onnx generate fp16 weights? I did not find that onnx has such a function, can you provide it for reference or script? thanks |
no, because the output of layer2 is now different. And the activation function in the pooler layer is TanH, not gelu. |
@chenzhanyiczy I see you have |
I use tensorflow. Do you have an example of tensorflow? |
@ttyio |
Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?
Sorry no.
Not batchNorm, could you set FP32 for the
Sorry no. |
@chenzhanyiczy , do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision. |
The verbose file is very large, take softmax as an example: [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) build code as follows: plus the code: config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES) [TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax) |
@chenzhanyiczy
You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32. |
these ops are in the pooler layer. The current accuracy is different in the layer_xxx layer
I try this: |
@chenzhanyiczy
|
@ttyio |
@chenzhanyiczy |
@ttyio
Either way, the result is wrong. 2 is better than 1, because 1 is wrong in layer_0/output/LayerNorm/moments/variance, and 2 is wrong in layer_2/output/LayerNorm/moments/variance I don't understand which object strict_type acts on? |
@chenzhanyiczy Back to your experiments, the precision setting in 2 is ignored in final kernel selection; 3 failed, and the error msg already tell us because some layer has fp16 requirement, but not enable fp_16 in build config. |
@ttyio
|
@chenzhanyiczy
|
@ttyio And back to the original case (the output of layer_2/output/LayerNorm/moments/variance), what should I do? I almost tried everything possible. |
Hello @chenzhanyiczy |
I tried strict_type + FP16 mode + all layer run on FP32(layer precision and output type) and FP32 mode two cases. The result of both are still big diff. Here is still used: the original case (the output of layer_2/output/LayerNorm/moments/variance). why is it so? |
@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks. |
@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks. |
Hello @chenzhanyiczy , |
@ttyio
And some reformats are automatically added:
I don't know why. I already have set the output type and precision of all layer are float, besides these:
these layers can't set output type and precision to float, because report error. such as the following:
thanks. |
Hello @chenzhanyiczy could you only use |
yes, like this:
But this 'NotEqual__' cann't set with layer.Unary, because for example, erf function etc are also on this layer. |
@chenzhanyiczy |
There seems to be no problem... let me take a look.
Can you use trt to automatically build bert-base model? no matter how I set, there is always a big difference. Thank you. |
Hello @chenzhanyiczy , |
If it is the above example, then the following is the value of this output.
no matter how to set, fp16+strict_type+all layer(output + preicison) VS fp32, the result is always big different.
We use bert to generate embedding, this involves algorithmic indicators, it’s not easy to say. :) |
Hello @chenzhanyiczy , I checked the internal TRT tests, and find the tolerance for TF bert is
The model we used in our test has no nodes with name |
@ttyio |
Hello @chenzhanyiczy |
Has this problem been solved? I have the same problem. The result of FP16 and FP32 is big different. I use trt 8.0.3 and 4 layers bert. @ttyio @chenzhanyiczy |
I modified skipln layer and use float32 dtype, the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio
|
same problem here, any suggestion? |
@ttyio I added the following Settings: 'strict_types': trt.BuilderFlag.STRICT_TYPES,
'fp16': trt.BuilderFlag.FP16, And added the following code, whether can realize the setting of mixing precision? for i in range(network.num_layers):
if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and network.get_layer(i).type != trt.LayerType.MATRIX_MULTIPLY and network.get_layer(i).type != trt.LayerType.SOFTMAX:
network.get_layer(i).precision = trt.DataType.FLOAT
network.get_layer(i).set_output_type(0, trt.DataType.FLOAT) Unfortunately, I encountered this error: ......
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
)
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
...... I think it's because the output of op is DataType::kINT32, but I force change it to be DataType::FLOAT. How can this be avoided? thank you very much. |
@zhaohb In your case, don't call @chenzhanyiczy Could you try TRT 8.2/8.4 and see if the issue still exists? If it does, we will debug it. Thanks |
Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks |
hello, I met the same problem. Could you please explain what the skipln is and where to modify these code? |
Description
When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?
Environment
TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 440.59
CUDA Version: 10.2
CUDNN Version: 8.0.4
Operating System: centos7
Python Version (if applicable): 3.6
Tensorflow Version (if applicable): 1.15.4
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Steps To Reproduce
Proceed as follows:
1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine
2、when trt building, set these parameters:
with builder.create_builder_config() as config:
config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16)
...
3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
network.get_layer(i).precision = trt.DataType.FLOAT
BUT no effect
I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure
The text was updated successfully, but these errors were encountered: