-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100 #3593
Comments
How is the perf of
|
@zerollzeng I tried this but the latency is still higher than just fp16... |
Could you please share the onnx here? If it's a QAT model, |
You may hit a known issue in TRT 8.6 and it's fixed in TRT 9.2. could you please try the latest TRT 9.2? https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-11.8.tar.gz |
@zerollzeng After converting to trt files, FP16 turns out to be faster in execution speed than the mixed precision model. BTW, what exactly is an issue in TRT 8.6 which is fixed in TRT 9.2? Thanks |
You just hit a bug that fix in TRT 9.2 :-) |
@zerollzeng Maybe any other suggestions? |
Could you please share the onnx that can reproduce this issue? |
@zerollzeng I’m also interested to know when will TRT v9.2 be released in docker images? |
We didn't release it in the official docker image since it's a limited EA release. but you can build the docker by using https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile |
May I ask why I see 2 onnx models here? |
@zerollzeng |
That's weird, you should only need 1 onnx. What if you compare the perf using only 1 onnx? just set full fp16 and set mixed precision separately. |
@zerollzeng
for further QAT, then I must specify the num_bits parameter. Shouldn't I make it so that when all layers have I mean, how does trtexec know which layers I need to convert to int8 precision if I don't specify it anywhere? I have two onnx because to convert to FP16 I specify |
@ttyio for above questions. |
@zerollzeng @ttyio |
Description
Hello,
I'm trying to do a torch -> onnx -> trt model conversion.
I am doing operations to convert to fp16, to int8 and to mixed precision (fp16 + int8). However, after the conversion is completed, the latency of the fp16 model turns out to be the smallest. Which means fp16 model is faster than int8 and mixed-precision models. Why is that?
Environment
TensorRT Version: 8.6
NVIDIA GPU: A100
NVIDIA Driver Version: 530.30.02
CUDA Version: 12.1
CUDNN Version:
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10
PyTorch Version (if applicable): 2.1
Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.08-py3
Relevant Files
Model link: "vit_base_patch32_224_clip_laion2b" model from timm.models
Steps To Reproduce
(optionally)
If precision is not fp16, but int8, then specify num_bits=8 in point 1 like that:
(optionally)
If the situation is with mixed_precision, then initially we create num_bits=16, then selectively for individual layers we change the values of input_quantizer and weight_quantizer to 8-bit like this:
We calibrate FakeQuant nodes and do QAT.
Do torch.onnx.export.
Simplify the onnx model through
If it is fp16 or int8 precision, then as follows:
fp16:
int8:
If this is mixed-precision, then first we create the str variable "LAYERS_PRECISION" and collect precision for layers in it, iterating over the onnx layers of the model.
The result is something like:
LAYERS_PRECISION="layer1:int8,layer2:int8,layer3:fp16,...,layerN:fp16,"
And then we execute the following command
Having done all of the above, we get trt files, which, when checked both through trtexec and through the model-analyzer utility for trt-server, show that the operating speed of the int8 and mixed-precision models is worse than that of the fp16 model.
Commands or scripts: see above
Have you tried the latest release?: yes
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): N/AThe text was updated successfully, but these errors were encountered: