BERT fp16 accuracy problem #1196

chenzhanyiczy · 2021-04-15T08:14:21Z

Description

When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

Environment

TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 440.59
CUDA Version: 10.2
CUDNN Version: 8.0.4
Operating System: centos7
Python Version (if applicable): 3.6
Tensorflow Version (if applicable): 1.15.4
PyTorch Version (if applicable):
Baremetal or Container (if so, version):

Steps To Reproduce

Proceed as follows：
1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine
2、when trt building, set these parameters：
with builder.create_builder_config() as config:
config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16)
...
3、at the same time, I also tried to set the accuracy on these layers（such as： LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on）：
network.get_layer(i).precision = trt.DataType.FLOAT
BUT no effect

I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

ttyio · 2021-04-16T02:11:00Z

Hello @chenzhanyiczy ,
What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?

    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)

if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32

    prev_layer.get_output(0).dtype = trt.DataType.FLOAT

Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

ttyio · 2021-04-16T03:21:45Z

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

chenzhanyiczy · 2021-04-16T12:04:25Z

Hello @chenzhanyiczy ,
What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?
    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)
if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32
    prev_layer.get_output(0).dtype = trt.DataType.FLOAT
Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

yes, I also use this flag（STRICT_TYPES） and set previous layer output type to FLOAT， but accuracy also has much diff.

The builder's code is similar to the following.
Here suppose I want to check the output of layer_2/output/LayerNorm/moments/variance of layer_2. The previous node of this node is SquaredDifference. The strange thing is that the output of this node(variance) of layer0 and layer1 is ok. In other words, their accuray is good.

if network.get_layer(i).name.find("output/LayerNorm/moments/SquaredDifference") != -1 \ or network.get_layer(i).name.find("intermediate/dense/Erf") != -1: for idx in range(network.get_layer(i).num_outputs): network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT) network.get_layer(i).precision = trt.DataType.FLOAT .... with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) .... with builder.build_engine(network, config) as engine: ...

The network structure is like this:

How does onnx generate fp16 weights? I did not find that onnx has such a function, can you provide it for reference or script？ thanks

chenzhanyiczy · 2021-04-16T12:08:42Z

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

no, because the output of layer2 is now different. And the activation function in the pooler layer is TanH, not gelu.

ttyio · 2021-04-16T13:49:47Z

@chenzhanyiczy
the onnx fp16 generation should looks like in pytorch this onnx/onnx-tensorrt#235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

chenzhanyiczy · 2021-04-19T04:54:50Z

@chenzhanyiczy
the onnx fp16 generation should looks like in pytorch this onnx/onnx-tensorrt#235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

I use tensorflow. Do you have an example of tensorflow?
I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect.
The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different? Their structure is the same

chenzhanyiczy · 2021-04-19T04:59:50Z

@ttyio
Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

ttyio · 2021-04-19T05:09:06Z

@chenzhanyiczy

The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different?

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

Do you have an example of tensorflow?

Sorry no.

I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect.

Not batchNorm, could you set FP32 for the tanh, pow, softmax?

Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

Sorry no.

chenzhanyiczy · 2021-04-19T06:11:40Z

@ttyio

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

The structure of each layer is probably attention -> intermediate -> output, just like bert-base.
I check the output in layer_2/output/LayerNorm/moments/SquaredDifference under fp32 and fp16 respectively, they are basically the same. BUT the output in layer_2/output/LayerNorm/moments/variance are totally different(infinitesimal under fp16)

could you set FP32 for the tanh, pow, softmax?

yes, no effect

ttyio · 2021-04-19T06:16:32Z

@chenzhanyiczy , do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

chenzhanyiczy · 2021-04-19T07:08:32Z

do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

The verbose file is very large, take softmax as an example:

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_0/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_1/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_2/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 212) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 213) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 381) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 382) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 550) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 551) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

build code as follows:
if network.get_layer(i).name.find("attention/self/Softmax") != -1 :
for idx in range(network.get_layer(i).num_outputs):
network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT)
network.get_layer(i).precision = trt.DataType.FLOAT
....
config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16)
...

plus the code: config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)
softmax verbose output :

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 212) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 214) [Shuffle] reformatted input 0 ((Unnamed Layer* 213) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 381) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 383) [Shuffle] reformatted input 0 ((Unnamed Layer* 382) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 550) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 552) [Shuffle] reformatted input 0 ((Unnamed Layer* 551) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 212) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 213) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 214) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 213) [Softmax]_output[Float(32)] -> (Unnamed Layer* 214) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 381) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 382) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 383) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 382) [Softmax]_output[Float(32)] -> (Unnamed Layer* 383) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 550) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 551) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 552) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 551) [Softmax]_output[Float(32)] -> (Unnamed Layer* 552) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

ttyio · 2021-04-19T07:19:02Z

@chenzhanyiczy
How did you grep all the tanh, pow nodes? A general way might check

  network.get_layer(i).type

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

chenzhanyiczy · 2021-04-19T09:39:55Z

@ttyio

How did you grep all the tanh, pow nodes?

these ops are in the pooler layer. The current accuracy is different in the layer_xxx layer

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

I try this:
if network.get_layer(i).type == trt.LayerType.FULLY_CONNECTED
or network.get_layer(i).type == trt.LayerType.MATRIX_MULTIPLY
or network.get_layer(i).type == trt.LayerType.SOFTMAX:
network.get_layer(i).precision = trt.DataType.HALF
...
other layer in fp32(default value). When building, still need to specify the fp16 flag, otherwise report these error:
[TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder
But specifying fp16 causes all layers to run under fp16...

ttyio · 2021-04-19T09:44:48Z

@chenzhanyiczy
here's the correct way to use mix precision:

add FP16 flag and strict in builder config

set higher precision for nodes that not want to run on lower precision

 if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and ....:
     network.get_layer(i).precision = trt.DataType.FLOAT
     network.get_layer(i).set_output(0).dtype = trt.DataType.FLOAT

chenzhanyiczy · 2021-04-19T11:56:39Z

@ttyio
I try it, the result of layer0 is also not accurate anymore. strict_type is to restrict trt selection, directly use fp16 type, more harmful.
Can you try to use trt build automation to run bert-base? our model is also based on bert-base.

ttyio · 2021-04-20T05:47:15Z

@chenzhanyiczy
If you set FP16 flag in the builder, and mark all layers as FP32 using the code in #1196 (comment), the engine should run all layers in FP32, why it is more harmful?

chenzhanyiczy · 2021-04-20T12:06:24Z

@ttyio
I try the following attempts (assume that the output is still layer_2/output/LayerNorm/moments/variance).

fp16 mode + strict_type, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32
fp16 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32
fp32 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, the builder reports an error, the following error:
[TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

Either way, the result is wrong. 2 is better than 1, because 1 is wrong in layer_0/output/LayerNorm/moments/variance, and 2 is wrong in layer_2/output/LayerNorm/moments/variance

I don't understand which object strict_type acts on?
for example:
config.flags = 1<<int(trt.BuilderFlag.FP16) | 1<<int(trt.BuilderFlag.STRICT_TYPES)
if network.get_layer(i).type == trt.LayerType.SOFTMAX:
network.get_layer(i).precision = trt.DataType.FLOAT
network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)
...
Here strict_type restricts other layer precision and output are fp16? or the precision and output of softmax are fp32?

ttyio · 2021-04-21T04:43:28Z

@chenzhanyiczy
let me explain strict_type,
When we set the precision in builder_config, this tells TRT beside FP32, which precision is also allowed to run all the nodes in the network, and finally select the fastest kernels.
When some layer has specified with precision, and trict_type not added, this only change the fusion logic in TRT, but trt will ignore the precision and still select the fastest kernels.
When some layer has specified with precision, and trict_type added, the will also impact the final kernel selection, some kernel that match the user precision requirement will be selected, even if it is not the fastest one.

Back to your experiments, the precision setting in 2 is ignored in final kernel selection; 3 failed, and the error msg already tell us because some layer has fp16 requirement, but not enable fp_16 in build config.

chenzhanyiczy · 2021-04-21T12:35:11Z

@ttyio
some confused.. For example, in fp16 flag + strict_type, I set the precision of softmax layer is fp32, like this;
softmax(layer).precision = trt.DataType.FLOAT

softmax will choose fp32 kernel ?
for other ops that do not manually set the precision (the network parsed by onnxParse()), what is the behavior? Will all choose fp16 kernel?

ttyio · 2021-04-22T00:25:50Z

@chenzhanyiczy
code should like this:

  softmax(layer).precision = trt.DataType.FLOAT
  softmax(layer).get_output(0).dtype = trt.DataType.FLOAT

yes
choose the fastest path

chenzhanyiczy · 2021-04-22T14:09:50Z

@ttyio
thanks. So, strict_type is only valid for layers with manually set precision and output, right?

And back to the original case (the output of layer_2/output/LayerNorm/moments/variance), what should I do? I almost tried everything possible.

ttyio · 2021-04-23T03:17:04Z

Hello @chenzhanyiczy
Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works. Then we can use this as base, move more layers into FP16 precision, finally we could get a network with mixed precision, all the sensitive layers run on FP32, and remaining run on FP16. This is the first step, you can first start with this, thanks！

chenzhanyiczy · 2021-04-23T07:28:18Z

@ttyio

Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works.

I tried strict_type + FP16 mode + all layer run on FP32(layer precision and output type) and FP32 mode two cases. The result of both are still big diff. Here is still used: the original case (the output of layer_2/output/LayerNorm/moments/variance). why is it so?

ttyio · 2021-04-23T09:14:44Z

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

chenzhanyiczy · 2021-04-25T05:59:03Z

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks.
build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz
build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

ttyio · 2021-04-26T00:53:59Z

Hello @chenzhanyiczy ,
Check the Engine Layer Information section from the log, there are still layers not in fp32.
some layer like the onehot plugin, I think you need only set the output type, because float is not acceptable as layer precision.
the mm layer before after gelu is also in fp16, you could grep dense/Erf to find the gelu. and check the mm layer before and after, you can see it is from half to half, could you make sure they are all correctly set? thanks!

chenzhanyiczy · 2021-05-11T09:02:10Z

@ttyio
yes, some are still fp16. Beacuse the builder has these warnings :

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation

And some reformats are automatically added:

[TensorRT] VERBOSE: Adding reformat layer: PWN(PWN(PWN(text/bert/encoder/layer_0/intermediate/dense/add/x:0_14 + (Unnamed Layer* 442) [Shuffle], PWN(PWN(PWN(text/bert/encoder/layer_1/intermediate/dense/Sqrt__41_13 + (Unnamed Layer* 438) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/truediv), text/bert/encoder/layer_1/intermediate/dense/Erf), text/bert/encoder/layer_1/intermediate/dense/add)), PWN(text/bert/encoder/layer_0/intermediate/dense/mul/x:0_15 + (Unnamed Layer* 445) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/mul)), text/bert/encoder/layer_1/intermediate/dense/mul_1) reformatted input 0 (text/bert/encoder/layer_1/intermediate/dense/BiasAdd:0) from Half(1,3072) to Float(1,3072)

I don't know why. I already have set the output type and precision of all layer are float, besides these:

if network.get_layer(i).name.find("zeros_like/Const") != -1
or network.get_layer(i).name.find("NotEqual/y") != -1
or network.get_layer(i).name.find("const_fold_opt") != -1
or network.get_layer(i).name.find("Concat__") != -1
or network.get_layer(i).name.find("NotEqual__") != -1
or network.get_layer(i).type == trt.LayerType.CONCATENATION
or network.get_layer(i).type == trt.LayerType.SHUFFLE
or network.get_layer(i).type == trt.LayerType.IDENTITY:
continue
...

these layers can't set output type and precision to float, because report error. such as the following:

INFO:root:layer name = [(Unnamed Layer* 5) [Shuffle]], layer type = [LayerType.SHUFFLE] precision = [DataType.FLOAT]
...
[TensorRT] ERROR: (Unnamed Layer* 5) [Shuffle]: cannot use precision Float for layer that computes indices
[TensorRT] ERROR: Layer (Unnamed Layer* 5) [Shuffle] failed validation

thanks.

ttyio · 2021-05-11T09:08:21Z

Hello @chenzhanyiczy

could you only use network.get_layer(i).type as filter condition? the network.get_layer(i).name seems risk to me, thanks!

chenzhanyiczy · 2021-05-11T13:36:01Z

@ttyio

could you only use network.get_layer(i).type as filter condition?

yes, like this:

if network.get_layer(i).name.find("NotEqual__") != -1
or network.get_layer(i).type == trt.LayerType.CONSTANT
or network.get_layer(i).type == trt.LayerType.CONCATENATION
or network.get_layer(i).type == trt.LayerType.SHUFFLE
or network.get_layer(i).type == trt.LayerType.IDENTITY:
continue
....

But this 'NotEqual__' cann't set with layer.Unary, because for example, erf function etc are also on this layer.

ttyio · 2021-05-12T01:55:21Z

@chenzhanyiczy
Could you elaborate more on why we cannot force unary run on FP32 precision? thanks

chenzhanyiczy · 2021-05-13T06:21:20Z

@ttyio

Could you elaborate more on why we cannot force unary run on FP32 precision?

There seems to be no problem... let me take a look.
Even if these are set, under fp16+strict_type+all layer(output + preicison), the result is still different from under fp32. And when building, it was all these warnings. In other words, may the fp16 operator is selected?

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/attention/output/dense/MatMul + text/bert/encoder/layer_2/attention/output/dense/bias__53 + (Unnamed Layer* 570) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/attention/output/dense/bias__53 + (Unnamed Layer* 570) [Shuffle] + text/bert/encoder/layer_2/attention/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/output/dense/MatMul + text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer* 630) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer* 630) [Shuffle] + text/bert/encoder/layer_2/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.

Can you use trt to automatically build bert-base model? no matter how I set, there is always a big difference. Thank you.

ttyio · 2021-05-19T09:39:23Z

Hello @chenzhanyiczy ,
What the metric do you use to check the accuracy, do you have data like the SQuAD F1 value? Thanks

chenzhanyiczy · 2021-05-20T02:30:30Z

@ttyio

What the metric do you use to check the accuracy,

If it is the above example, then the following is the value of this output.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks.
build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz
build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

no matter how to set, fp16+strict_type+all layer(output + preicison) VS fp32, the result is always big different.

do you have data like the SQuAD F1 value?

We use bert to generate embedding, this involves algorithmic indicators, it’s not easy to say. :)
Have you tried to automatically build a bert-base model with trt? I think it should be easier to reproduce.
Thank you!

ttyio · 2021-05-20T14:37:41Z

Hello @chenzhanyiczy , I checked the internal TRT tests, and find the tolerance for TF bert is

          rtol=1e-3, atol=1.5

The model we used in our test has no nodes with name zeros_like, so there might some difference with yours. Have you tried train your model in FP16? thanks

chenzhanyiczy · 2021-05-22T01:28:42Z

@ttyio
Can you share the code of how to trt build automatically? Is there any fp32 change to which layer? Thanks.
We have a large tolerance under fp16(fp16+strict_type+all layer(output + preicison)),such as: rtol=1e-2
The zeros_like layer is only for padding, so should not affect the accuracy. Train in fp16 is more difficult...

ttyio · 2021-05-26T08:59:39Z

Hello @chenzhanyiczy
The trt test is simple, just use polygraphy to run the network using trt fp16 and onnxruntime fp32, not cover any strict_type setting.

mdztravelling · 2021-12-23T08:23:40Z

Has this problem been solved? I have the same problem. The result of FP16 and FP32 is big different. I use trt 8.0.3 and 4 layers bert. @ttyio @chenzhanyiczy

mdztravelling · 2021-12-24T09:56:09Z

I modified skipln layer and use float32 dtype， the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio

def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...

yushcs · 2022-03-01T02:18:31Z

same problem here, any suggestion?

zhaohb · 2022-03-02T07:22:54Z

@ttyio
hi, I also want to achieve TRT mixing accuracy.

I added the following Settings:

'strict_types': trt.BuilderFlag.STRICT_TYPES,
'fp16': trt.BuilderFlag.FP16,

And added the following code, whether can realize the setting of mixing precision？

        for i in range(network.num_layers):
            if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and network.get_layer(i).type != trt.LayerType.MATRIX_MULTIPLY and network.get_layer(i).type != trt.LayerType.SOFTMAX:
                network.get_layer(i).precision = trt.DataType.FLOAT
                network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)

Unfortunately, I encountered this error：

......
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
)
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
......

I think it's because the output of op is DataType::kINT32, but I force change it to be DataType::FLOAT. How can this be avoided? thank you very much.

nvpohanh · 2022-06-15T10:19:34Z

@zhaohb In your case, don't call set_output_type if layer.get_output_type(0) returns kINT32.

@chenzhanyiczy Could you try TRT 8.2/8.4 and see if the issue still exists? If it does, we will debug it. Thanks

nvpohanh · 2022-07-01T05:58:09Z

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

ArtemisZGL · 2023-06-08T04:19:28Z

I modified skipln layer and use float32 dtype， the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio
def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...

hello, I met the same problem. Could you please explain what the skipln is and where to modify these code?

chenzhanyiczy changed the title ~~fp16 accuracy problem~~ BERT fp16 accuracy problem Apr 15, 2021

ttyio added Precision: FP16 triaged Issue has been triaged by maintainers labels Apr 16, 2021

chenzhanyiczy closed this as completed Apr 19, 2021

chenzhanyiczy reopened this Apr 19, 2021

nvpohanh closed this as completed Jul 1, 2022

YouSenRong mentioned this issue Oct 21, 2022

Not implementation of "some layers". No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation. #2407

Closed

zerollzeng mentioned this issue Jan 5, 2023

After converting the bert-base-Chinese model to fp16, the accuracy drops greatly. #2585

Closed

ArtemisZGL mentioned this issue Jun 8, 2023

fp16 ouputs error of TensorRT 8.2.1.8 when running BERT #3041

Closed

jchenghu mentioned this issue Jun 23, 2023

Convert to ONNX then to TensorRT format jchenghu/ExpansionNet_v2#2

Closed

jinluyang mentioned this issue Nov 9, 2023

Forcing layernorm layers to run in FP32 precision #2781

Closed

BERT fp16 accuracy problem #1196

BERT fp16 accuracy problem #1196

Comments

chenzhanyiczy commented Apr 15, 2021

Description

Environment

Steps To Reproduce

ttyio commented Apr 16, 2021 • edited Loading

ttyio commented Apr 16, 2021

chenzhanyiczy commented Apr 16, 2021 • edited Loading

chenzhanyiczy commented Apr 16, 2021

ttyio commented Apr 16, 2021

chenzhanyiczy commented Apr 19, 2021

chenzhanyiczy commented Apr 19, 2021

ttyio commented Apr 19, 2021 • edited Loading

chenzhanyiczy commented Apr 19, 2021

ttyio commented Apr 19, 2021

chenzhanyiczy commented Apr 19, 2021

ttyio commented Apr 19, 2021

chenzhanyiczy commented Apr 19, 2021

ttyio commented Apr 19, 2021

chenzhanyiczy commented Apr 19, 2021 • edited Loading

ttyio commented Apr 20, 2021

chenzhanyiczy commented Apr 20, 2021

ttyio commented Apr 21, 2021

chenzhanyiczy commented Apr 21, 2021

ttyio commented Apr 22, 2021 • edited Loading

chenzhanyiczy commented Apr 22, 2021

ttyio commented Apr 23, 2021

chenzhanyiczy commented Apr 23, 2021

ttyio commented Apr 23, 2021

chenzhanyiczy commented Apr 25, 2021 • edited Loading

ttyio commented Apr 26, 2021

chenzhanyiczy commented May 11, 2021

ttyio commented May 11, 2021

chenzhanyiczy commented May 11, 2021

ttyio commented May 12, 2021 • edited Loading

chenzhanyiczy commented May 13, 2021

ttyio commented May 19, 2021

chenzhanyiczy commented May 20, 2021

ttyio commented May 20, 2021

chenzhanyiczy commented May 22, 2021

ttyio commented May 26, 2021

mdztravelling commented Dec 23, 2021

mdztravelling commented Dec 24, 2021

yushcs commented Mar 1, 2022

zhaohb commented Mar 2, 2022 • edited Loading

nvpohanh commented Jun 15, 2022

nvpohanh commented Jul 1, 2022

ArtemisZGL commented Jun 8, 2023

ttyio commented Apr 16, 2021 •

edited

Loading

chenzhanyiczy commented Apr 16, 2021 •

edited

Loading

ttyio commented Apr 19, 2021 •

edited

Loading

chenzhanyiczy commented Apr 19, 2021 •

edited

Loading

ttyio commented Apr 22, 2021 •

edited

Loading

chenzhanyiczy commented Apr 25, 2021 •

edited

Loading

ttyio commented May 12, 2021 •

edited

Loading

zhaohb commented Mar 2, 2022 •

edited

Loading