Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT fp16 accuracy problem #1196

Closed
chenzhanyiczy opened this issue Apr 15, 2021 · 43 comments
Closed

BERT fp16 accuracy problem #1196

chenzhanyiczy opened this issue Apr 15, 2021 · 43 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@chenzhanyiczy
Copy link

Description

When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

Environment

TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 440.59
CUDA Version: 10.2
CUDNN Version: 8.0.4
Operating System: centos7
Python Version (if applicable): 3.6
Tensorflow Version (if applicable): 1.15.4
PyTorch Version (if applicable):
Baremetal or Container (if so, version):

Steps To Reproduce

Proceed as follows:
1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine
2、when trt building, set these parameters:
with builder.create_builder_config() as config:
config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16)
...
3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
network.get_layer(i).precision = trt.DataType.FLOAT
BUT no effect

I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

@chenzhanyiczy chenzhanyiczy changed the title fp16 accuracy problem BERT fp16 accuracy problem Apr 15, 2021
@ttyio
Copy link
Collaborator

ttyio commented Apr 16, 2021

Hello @chenzhanyiczy ,
What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?

    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)

if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32

    prev_layer.get_output(0).dtype = trt.DataType.FLOAT

Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

@ttyio ttyio added Precision: FP16 triaged Issue has been triaged by maintainers labels Apr 16, 2021
@ttyio
Copy link
Collaborator

ttyio commented Apr 16, 2021

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

@chenzhanyiczy
Copy link
Author

chenzhanyiczy commented Apr 16, 2021

Hello @chenzhanyiczy ,
What's your metric to verify the accuracy? usually we need some benchmark e.g, F1 score on SQuAD. Sometimes the bit level mismatch won't hurt the final accuracy.

Also have your set strict types when try mixed precision?

    config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)

if we want to experiment on accuracy sensitive layers, sometime we might also need set input (the output of previous layer) as FP32

    prev_layer.get_output(0).dtype = trt.DataType.FLOAT

Other experiments worth doing is that, generate ONNX model with FP16 weights, try run on onnxruntime. This is the upper bar you can get if you run you model in all FP16 precision. You can focus on try more layers run on FP32 precision to meet higher accuracy requirement.

yes, I also use this flag(STRICT_TYPES) and set previous layer output type to FLOAT, but accuracy also has much diff.

The builder's code is similar to the following.
Here suppose I want to check the output of layer_2/output/LayerNorm/moments/variance of layer_2. The previous node of this node is SquaredDifference. The strange thing is that the output of this node(variance) of layer0 and layer1 is ok. In other words, their accuray is good.

if network.get_layer(i).name.find("output/LayerNorm/moments/SquaredDifference") != -1 \ or network.get_layer(i).name.find("intermediate/dense/Erf") != -1: for idx in range(network.get_layer(i).num_outputs): network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT) network.get_layer(i).precision = trt.DataType.FLOAT .... with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) .... with builder.build_engine(network, config) as engine: ...

The network structure is like this:
query_emb_model2 (1)

How does onnx generate fp16 weights? I did not find that onnx has such a function, can you provide it for reference or script? thanks

@chenzhanyiczy
Copy link
Author

@chenzhanyiczy , could you do another experiment to make the whole gelu expression run on FP32 precision? thanks!

no, because the output of layer2 is now different. And the activation function in the pooler layer is TanH, not gelu.

@ttyio
Copy link
Collaborator

ttyio commented Apr 16, 2021

@chenzhanyiczy
the onnx fp16 generation should looks like in pytorch this onnx/onnx-tensorrt#235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

@chenzhanyiczy
Copy link
Author

@chenzhanyiczy
the onnx fp16 generation should looks like in pytorch this onnx/onnx-tensorrt#235

I see you have erf, is it for gelu? If it is hard to match patterns, you could try mark all the tanh , pow, softmax to run on FP32 precision.

I use tensorflow. Do you have an example of tensorflow?
I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect.
The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different? Their structure is the same

@chenzhanyiczy
Copy link
Author

@ttyio
Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

@ttyio
Copy link
Collaborator

ttyio commented Apr 19, 2021

@chenzhanyiczy

The result of layer0 and layer1 ok(refer to the above), but why the result of layer2 is different?

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

Do you have an example of tensorflow?

Sorry no.

I try to do these, such as: set all the ops of the batchNorm part to fp32, but no effect.

Not batchNorm, could you set FP32 for the tanh, pow, softmax?

Do you have an example of generate bert engine through trt automatic conversion? not like demo bert

Sorry no.

@chenzhanyiczy
Copy link
Author

@ttyio

Have your checked the output data range distribution for each layers in each encoder, Is it possible that encoder0 and encoder1 is within the fp16 range, but we overflow fp16 start from encoder2?

The structure of each layer is probably attention -> intermediate -> output, just like bert-base.
I check the output in layer_2/output/LayerNorm/moments/SquaredDifference under fp32 and fp16 respectively, they are basically the same. BUT the output in layer_2/output/LayerNorm/moments/variance are totally different(infinitesimal under fp16)
image

could you set FP32 for the tanh, pow, softmax?

yes, no effect

@ttyio
Copy link
Collaborator

ttyio commented Apr 19, 2021

@chenzhanyiczy , do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

@chenzhanyiczy chenzhanyiczy reopened this Apr 19, 2021
@chenzhanyiczy
Copy link
Author

do you have the verbose log when tanh, pow and softmax all in fp32? I want to make sure these nodes are really run in FP32 precision.

The verbose file is very large, take softmax as an example:

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_0/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_1/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: For layer text/bert/encoder/layer_2/attention/self/Softmax a non-conforming implementation was chosen than was requested i.e. requested layer computation precision and output precision types were ignored because it resulted in faster network performance. Enable strict mode to try force choose a conforming implementation.
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 212) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 213) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 381) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 382) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, (Unnamed Layer* 550) [Shuffle]_output[Half(32)] -> (Unnamed Layer* 551) [Softmax]_output[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

build code as follows:
if network.get_layer(i).name.find("attention/self/Softmax") != -1 :
for idx in range(network.get_layer(i).num_outputs):
network.get_layer(i).set_output_type(idx, trt.DataType.FLOAT)
network.get_layer(i).precision = trt.DataType.FLOAT
....
config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16)
...

plus the code: config.flags = config.flags | 1<<int(trt.BuilderFlag.STRICT_TYPES)
softmax verbose output :

[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_0/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_1/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (SoftMax)
[TensorRT] VERBOSE: --------------- Timing Runner: text/bert/encoder/layer_2/attention/self/Softmax (ExtSoftMax)
[TensorRT] VERBOSE: >>>>>>>>>>>>>>> Chose Runner Type: ExtSoftMax Tactic: 0
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 212) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 214) [Shuffle] reformatted input 0 ((Unnamed Layer* 213) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 381) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 383) [Shuffle] reformatted input 0 ((Unnamed Layer* 382) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Adding reformat layer: text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0 ((Unnamed Layer* 550) [Shuffle]_output) from Half(1,32) to Float(1,32)
[TensorRT] VERBOSE: Adding reformat layer: (Unnamed Layer* 552) [Shuffle] reformatted input 0 ((Unnamed Layer* 551) [Softmax]_output) from Float(1,32) to Half(1,32)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_0/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_1/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 (type=Reformat, tactic=0)
[TensorRT] VERBOSE: Debug synchronize completed successfully after build for layer: text/bert/encoder/layer_2/attention/self/Softmax (type=ExtSoftMax, tactic=0)
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_0/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_1/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0 Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer: text/bert/encoder/layer_2/attention/self/Softmax Weights: 0 HostPersistent: 0 DevicePersistent: 0
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_0/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 212) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_0/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_0/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 213) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 214) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 213) [Softmax]_output[Float(32)] -> (Unnamed Layer* 214) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_0/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_0/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_0/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_0/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_1/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 381) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_1/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_1/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 382) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 383) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 382) [Softmax]_output[Float(32)] -> (Unnamed Layer* 383) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_1/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_1/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_1/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_1/attention/self/MatMul_1:0[Half(12,32,64)]
[TensorRT] VERBOSE: Layer(Reformat): text/bert/encoder/layer_2/attention/self/Softmax input reformatter 0, Tactic: 0, (Unnamed Layer* 550) [Shuffle]_output[Half(32)] -> text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)]
[TensorRT] VERBOSE: Layer(ExtSoftMax): text/bert/encoder/layer_2/attention/self/Softmax, Tactic: 0, text/bert/encoder/layer_2/attention/self/Softmax reformatted input 0[Float(32)] -> (Unnamed Layer* 551) [Softmax]_output[Float(32)]
[TensorRT] VERBOSE: Layer(Reformat): (Unnamed Layer* 552) [Shuffle] input reformatter 0, Tactic: 0, (Unnamed Layer* 551) [Softmax]_output[Float(32)] -> (Unnamed Layer* 552) [Shuffle] reformatted input 0[Half(32)]
[TensorRT] VERBOSE: Layer(MatrixMultiply): text/bert/encoder/layer_2/attention/self/MatMul_1, Tactic: 1, text/bert/encoder/layer_2/attention/self/Softmax:0[Half(12,32,32)], text/bert/encoder/layer_2/attention/self/transpose_2:0[Half(12,32,64)] -> text/bert/encoder/layer_2/attention/self/MatMul_1:0[Half(12,32,64)]

@ttyio
Copy link
Collaborator

ttyio commented Apr 19, 2021

@chenzhanyiczy
How did you grep all the tanh, pow nodes? A general way might check

  network.get_layer(i).type

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

@chenzhanyiczy
Copy link
Author

@ttyio

How did you grep all the tanh, pow nodes?

these ops are in the pooler layer. The current accuracy is different in the layer_xxx layer

You can first only leave conv/gemm in fp16 precision, reset of the nodes all run in fp32.

I try this:
if network.get_layer(i).type == trt.LayerType.FULLY_CONNECTED
or network.get_layer(i).type == trt.LayerType.MATRIX_MULTIPLY
or network.get_layer(i).type == trt.LayerType.SOFTMAX:
network.get_layer(i).precision = trt.DataType.HALF
...
other layer in fp32(default value). When building, still need to specify the fp16 flag, otherwise report these error:
[TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder
But specifying fp16 causes all layers to run under fp16...

@ttyio
Copy link
Collaborator

ttyio commented Apr 19, 2021

@chenzhanyiczy
here's the correct way to use mix precision:

  1. add FP16 flag and strict in builder config

  2. set higher precision for nodes that not want to run on lower precision

     if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and ....:
         network.get_layer(i).precision = trt.DataType.FLOAT
         network.get_layer(i).set_output(0).dtype = trt.DataType.FLOAT
    

@chenzhanyiczy
Copy link
Author

chenzhanyiczy commented Apr 19, 2021

@ttyio
I try it, the result of layer0 is also not accurate anymore. strict_type is to restrict trt selection, directly use fp16 type, more harmful.
Can you try to use trt build automation to run bert-base? our model is also based on bert-base.

@ttyio
Copy link
Collaborator

ttyio commented Apr 20, 2021

@chenzhanyiczy
If you set FP16 flag in the builder, and mark all layers as FP32 using the code in #1196 (comment), the engine should run all layers in FP32, why it is more harmful?

@chenzhanyiczy
Copy link
Author

@ttyio
I try the following attempts (assume that the output is still layer_2/output/LayerNorm/moments/variance).

  1. fp16 mode + strict_type, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32

  2. fp16 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, and the remaining op precision and output are fp32

  3. fp32 mode, FULLY_CONNECTED + MATRIX_MULTIPLY + SOFTMAX+... The precision and output are both fp16, the builder reports an error, the following error:
    [TensorRT] ERROR: fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

Either way, the result is wrong. 2 is better than 1, because 1 is wrong in layer_0/output/LayerNorm/moments/variance, and 2 is wrong in layer_2/output/LayerNorm/moments/variance

I don't understand which object strict_type acts on?
for example:
config.flags = 1<<int(trt.BuilderFlag.FP16) | 1<<int(trt.BuilderFlag.STRICT_TYPES)
if network.get_layer(i).type == trt.LayerType.SOFTMAX:
network.get_layer(i).precision = trt.DataType.FLOAT
network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)
...
Here strict_type restricts other layer precision and output are fp16? or the precision and output of softmax are fp32?

@ttyio
Copy link
Collaborator

ttyio commented Apr 21, 2021

@chenzhanyiczy
let me explain strict_type,
When we set the precision in builder_config, this tells TRT beside FP32, which precision is also allowed to run all the nodes in the network, and finally select the fastest kernels.
When some layer has specified with precision, and trict_type not added, this only change the fusion logic in TRT, but trt will ignore the precision and still select the fastest kernels.
When some layer has specified with precision, and trict_type added, the will also impact the final kernel selection, some kernel that match the user precision requirement will be selected, even if it is not the fastest one.

Back to your experiments, the precision setting in 2 is ignored in final kernel selection; 3 failed, and the error msg already tell us because some layer has fp16 requirement, but not enable fp_16 in build config.

@chenzhanyiczy
Copy link
Author

@ttyio
some confused.. For example, in fp16 flag + strict_type, I set the precision of softmax layer is fp32, like this;
softmax(layer).precision = trt.DataType.FLOAT

  1. softmax will choose fp32 kernel ?
  2. for other ops that do not manually set the precision (the network parsed by onnxParse()), what is the behavior? Will all choose fp16 kernel?

@ttyio
Copy link
Collaborator

ttyio commented Apr 22, 2021

@chenzhanyiczy
code should like this:

  softmax(layer).precision = trt.DataType.FLOAT
  softmax(layer).get_output(0).dtype = trt.DataType.FLOAT
  1. yes
  2. choose the fastest path

@chenzhanyiczy
Copy link
Author

@ttyio
thanks. So, strict_type is only valid for layers with manually set precision and output, right?

And back to the original case (the output of layer_2/output/LayerNorm/moments/variance), what should I do? I almost tried everything possible.

@ttyio
Copy link
Collaborator

ttyio commented Apr 23, 2021

Hello @chenzhanyiczy
Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works. Then we can use this as base, move more layers into FP16 precision, finally we could get a network with mixed precision, all the sensitive layers run on FP32, and remaining run on FP16. This is the first step, you can first start with this, thanks!

@chenzhanyiczy
Copy link
Author

@ttyio

Since the FP32 precision works, so I suppose set both strict_type and FP16 in the builder flag, and mark all layer run on FP32 would also works.

I tried strict_type + FP16 mode + all layer run on FP32(layer precision and output type) and FP32 mode two cases. The result of both are still big diff. Here is still used: the original case (the output of layer_2/output/LayerNorm/moments/variance). why is it so?

@ttyio
Copy link
Collaborator

ttyio commented Apr 23, 2021

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

@chenzhanyiczy
Copy link
Author

chenzhanyiczy commented Apr 25, 2021

@chenzhanyiczy , could you provide the verbose build log for the 2 runs? thanks.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks.
build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz
build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

@ttyio
Copy link
Collaborator

ttyio commented Apr 26, 2021

Hello @chenzhanyiczy ,
Check the Engine Layer Information section from the log, there are still layers not in fp32.
some layer like the onehot plugin, I think you need only set the output type, because float is not acceptable as layer precision.
the mm layer before after gelu is also in fp16, you could grep dense/Erf to find the gelu. and check the mm layer before and after, you can see it is from half to half, could you make sure they are all correctly set? thanks!

@chenzhanyiczy
Copy link
Author

@ttyio
yes, some are still fp16. Beacuse the builder has these warnings :

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation

And some reformats are automatically added:

[TensorRT] VERBOSE: Adding reformat layer: PWN(PWN(PWN(text/bert/encoder/layer_0/intermediate/dense/add/x:0_14 + (Unnamed Layer* 442) [Shuffle], PWN(PWN(PWN(text/bert/encoder/layer_1/intermediate/dense/Sqrt__41_13 + (Unnamed Layer* 438) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/truediv), text/bert/encoder/layer_1/intermediate/dense/Erf), text/bert/encoder/layer_1/intermediate/dense/add)), PWN(text/bert/encoder/layer_0/intermediate/dense/mul/x:0_15 + (Unnamed Layer* 445) [Shuffle], text/bert/encoder/layer_1/intermediate/dense/mul)), text/bert/encoder/layer_1/intermediate/dense/mul_1) reformatted input 0 (text/bert/encoder/layer_1/intermediate/dense/BiasAdd:0) from Half(1,3072) to Float(1,3072)

I don't know why. I already have set the output type and precision of all layer are float, besides these:

if network.get_layer(i).name.find("zeros_like/Const") != -1
or network.get_layer(i).name.find("NotEqual/y") != -1
or network.get_layer(i).name.find("const_fold_opt") != -1
or network.get_layer(i).name.find("Concat__") != -1
or network.get_layer(i).name.find("NotEqual__") != -1
or network.get_layer(i).type == trt.LayerType.CONCATENATION
or network.get_layer(i).type == trt.LayerType.SHUFFLE
or network.get_layer(i).type == trt.LayerType.IDENTITY:
continue
...

these layers can't set output type and precision to float, because report error. such as the following:

INFO:root:layer name = [(Unnamed Layer* 5) [Shuffle]], layer type = [LayerType.SHUFFLE] precision = [DataType.FLOAT]
...
[TensorRT] ERROR: (Unnamed Layer* 5) [Shuffle]: cannot use precision Float for layer that computes indices
[TensorRT] ERROR: Layer (Unnamed Layer* 5) [Shuffle] failed validation

thanks.

@ttyio
Copy link
Collaborator

ttyio commented May 11, 2021

Hello @chenzhanyiczy

could you only use network.get_layer(i).type as filter condition? the network.get_layer(i).name seems risk to me, thanks!

@chenzhanyiczy
Copy link
Author

@ttyio

could you only use network.get_layer(i).type as filter condition?

yes, like this:

if network.get_layer(i).name.find("NotEqual__") != -1
or network.get_layer(i).type == trt.LayerType.CONSTANT
or network.get_layer(i).type == trt.LayerType.CONCATENATION
or network.get_layer(i).type == trt.LayerType.SHUFFLE
or network.get_layer(i).type == trt.LayerType.IDENTITY:
continue
....

But this 'NotEqual__' cann't set with layer.Unary, because for example, erf function etc are also on this layer.

@ttyio
Copy link
Collaborator

ttyio commented May 12, 2021

@chenzhanyiczy
Could you elaborate more on why we cannot force unary run on FP32 precision? thanks

@chenzhanyiczy
Copy link
Author

@ttyio

Could you elaborate more on why we cannot force unary run on FP32 precision?

There seems to be no problem... let me take a look.
Even if these are set, under fp16+strict_type+all layer(output + preicison), the result is still different from under fp32. And when building, it was all these warnings. In other words, may the fp16 operator is selected?

[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/attention/output/dense/MatMul + text/bert/encoder/layer_2/attention/output/dense/bias__53 + (Unnamed Layer* 570) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/attention/output/dense/bias__53 + (Unnamed Layer* 570) [Shuffle] + text/bert/encoder/layer_2/attention/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/intermediate/dense/MatMul + text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/intermediate/dense/bias__57 + (Unnamed Layer* 604) [Shuffle] + text/bert/encoder/layer_2/intermediate/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[TensorRT] WARNING: No implementation of layer text/bert/encoder/layer_2/output/dense/MatMul + text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer* 630) [Shuffle] + unsqueeze_node_after_text/bert/encoder/layer_2/output/dense/bias__60 + (Unnamed Layer* 630) [Shuffle] + text/bert/encoder/layer_2/output/dense/BiasAdd obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.

Can you use trt to automatically build bert-base model? no matter how I set, there is always a big difference. Thank you.

@ttyio
Copy link
Collaborator

ttyio commented May 19, 2021

Hello @chenzhanyiczy ,
What the metric do you use to check the accuracy, do you have data like the SQuAD F1 value? Thanks

@chenzhanyiczy
Copy link
Author

@ttyio

What the metric do you use to check the accuracy,

If it is the above example, then the following is the value of this output.

@ttyio ok. The following files are fp32 mode(default behavior) and fp16 mode+strict_type+all layer fp32(precision and output type). Thanks.
build_fp32_layer2_output_LayerNorm_moments_variance.tar.gz
build_fp16_layer2_output_LayerNorm_moments_variance.tar.gz

no matter how to set, fp16+strict_type+all layer(output + preicison) VS fp32, the result is always big different.

do you have data like the SQuAD F1 value?

We use bert to generate embedding, this involves algorithmic indicators, it’s not easy to say. :)
Have you tried to automatically build a bert-base model with trt? I think it should be easier to reproduce.
Thank you!

@ttyio
Copy link
Collaborator

ttyio commented May 20, 2021

Hello @chenzhanyiczy , I checked the internal TRT tests, and find the tolerance for TF bert is

          rtol=1e-3, atol=1.5

The model we used in our test has no nodes with name zeros_like, so there might some difference with yours. Have you tried train your model in FP16? thanks

@chenzhanyiczy
Copy link
Author

@ttyio
Can you share the code of how to trt build automatically? Is there any fp32 change to which layer? Thanks.
We have a large tolerance under fp16(fp16+strict_type+all layer(output + preicison)),such as: rtol=1e-2
The zeros_like layer is only for padding, so should not affect the accuracy. Train in fp16 is more difficult...

@ttyio
Copy link
Collaborator

ttyio commented May 26, 2021

Hello @chenzhanyiczy
The trt test is simple, just use polygraphy to run the network using trt fp16 and onnxruntime fp32, not cover any strict_type setting.

@mdztravelling
Copy link

Has this problem been solved? I have the same problem. The result of FP16 and FP32 is big different. I use trt 8.0.3 and 4 layers bert. @ttyio @chenzhanyiczy

@mdztravelling
Copy link

I modified skipln layer and use float32 dtype, the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio

def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...

@yushcs
Copy link

yushcs commented Mar 1, 2022

same problem here, any suggestion?

@zhaohb
Copy link

zhaohb commented Mar 2, 2022

@ttyio
hi, I also want to achieve TRT mixing accuracy.

I added the following Settings:

'strict_types': trt.BuilderFlag.STRICT_TYPES,
'fp16': trt.BuilderFlag.FP16,

And added the following code, whether can realize the setting of mixing precision?

        for i in range(network.num_layers):
            if network.get_layer(i).type != trt.LayerType.FULLY_CONNECTED and network.get_layer(i).type != trt.LayerType.MATRIX_MULTIPLY and network.get_layer(i).type != trt.LayerType.SOFTMAX:
                network.get_layer(i).precision = trt.DataType.FLOAT
                network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)

Unfortunately, I encountered this error:

......
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
)
[03/02/2022-07:41:14] [TRT] [E] [layers.h::setOutputType::1219] Error Code 3: API Usage Error (Parameter check failed at: /_src/build/cuda-11.4/8.2/x86_64/release/optimizer/api/layers.h::setOutputType::1219, condition: dataType == DataType::kINT32
......

I think it's because the output of op is DataType::kINT32, but I force change it to be DataType::FLOAT. How can this be avoided? thank you very much.

@nvpohanh
Copy link
Collaborator

@zhaohb In your case, don't call set_output_type if layer.get_output_type(0) returns kINT32.

@chenzhanyiczy Could you try TRT 8.2/8.4 and see if the issue still exists? If it does, we will debug it. Thanks

@nvpohanh
Copy link
Collaborator

nvpohanh commented Jul 1, 2022

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

@ArtemisZGL
Copy link

I modified skipln layer and use float32 dtype, the result different is smaller (< 0.0002). @chenzhanyiczy @ttyio

def skipln(prefix, config, init_dict, network, input_tensor, skip, is_last_skipln=False):
    """ 
    Add the skip layer
    """
    hidden_size = config.hidden_size
    #dtype = config.get_trt_dtype()
    dtype = trt.float32    # modify here
   ...

hello, I met the same problem. Could you please explain what the skipln is and where to modify these code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

7 participants