-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to convert the onnx model to fp16 model? #489
Comments
Hi @yuananf! At the moment the onnx pipeline is less optimized than its pytorch counterpart, so all computation happens in |
Thank you for your response! |
To avoid data copy between cpu and gpu, onnxruntime provided IOBinding feature. https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device |
It's possible to convert individual |
@wareya could you share the details of your scripts / conversion process ? |
First, get the full-precision onnx model locally from the onnx exporter (
Then modify However, on my system, this crashes inside of
|
Latest script can be found here: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md Example script to convert FP32 to FP16:
To get best performance, please set providers like the following:
See https://onnxruntime.ai/docs/performance/tune-performance.html#convolution-heavy-models-and-the-cuda-ep for more info. Latency (Seconds per Query) for GPU
|
Nice! @anton-l do you want to take a look here? |
I tried @tianleiwu 's script and all my results, while twice as fast, are very washed out. I'm also getting a lot of all black box results. This was on an AMD card using the DML provider on windows. The non converted-to-fp16 files work fine. |
@cprivitere, thanks for the feedback. To improve accuracy, onnx model need convert to mixed precision by adding some operators (like LayerNormalization, Gelu etc) to op_block_list in the script. The list could be tuned by some A/B testing (Comparing to fp32 results) using a test set. |
I've had some time since my last post to actually install and set all this up on Linux, none of the speed increases here even come close to the speed of running the fp16 models on ROCm, sadly. On a 6750XT we're talking 4.5it/s using LMS on Linux fp16 ROCm models versus 1.4it/s using LMS with the broken fp16 onnx models. |
Any updates here @anton-l ? |
FP16 models are now supported when tracing on GPU, thanks to @SkyTNT: #932 |
Just a note for folks that this fp16 conversion of the ONNX models does NOT support AMD GPUs. It only works on NVIDIA. |
The one I posted works on AMD GPUs, at least. |
@anton-l I ran the FP32 to FP16 @tianleiwu provided and was able to convert a Onnx FP32 Model to Onnx FP16 Model. Windows 11 When attempting to load the FP16 Model the following error is received when using the OnnxStableDiffusionPipeline.from_pretrained:
|
Hi, |
@kamalkraj the current ONNX pipeline design hurts GPU latency, so for now its main use cases are CPU inference and supporting environments which torch doesn't support (e.g. some AMD GPUs). Other cause could be that our conversion to ONNX is not taking advantage of all of the optimization features in ONNX and ONNXRuntime. We're working with the Optimum team to improve that. |
@averad, you can add "RandomNormalLike" to op_block_list to avoid the error. The latest script is here: @kamalkraj, you can run like the following to reproduce ~10s on T4.
|
In case someone is still interested: I used a script inspired by @tianleiwu's to convert just the UNET to fp16, leaving everything else on fp32. Though I had to add 'only_onnxruntime=True' to the arguments of 'optimize_model' to make it work, otherwise it crashed with some tensor dimension problems. |
Hi @tianleiwu , Will this also work with Stable Diffusion 2/2.1? |
@kamalkraj, I will try SD 2/2.1 and get back to you later. |
@tianleiwu When converted the stable-diffusion v1-4 onnx model from fp32 using the script provided, Observing that the converted model size is reduced but when loaded the model in netron, observed that still outputs and inputs are shown to be FP32 precision. Is this expected? Cant we generate a complete fp16 model using the available scripts? Because while running the inference, with CPU_FP16 flag with Openvino execution provider support, the device is shown as CPU_OPENVINO_CPU_FP32 instead of CPU_FP16. Reason for that might be that, since so called fp16 model is still having inputs and outputs with fp32, the inference device is fp32. Any thoughts on this? |
@saikrishna2893, @kamalkraj, you can try out our optimizations for SD 2 or SD 2.1. For SD 2.1, you will need add Attention (or MultiHeadAttention) to op_block_list to run it in float32 instead of float16. Otherwise, you will see black image. Note that the script contains optimizations for CUDA EP only since some optimized operators might not be available to other EP. See the comments at the beginning of script for usage. The python environment used in test is like the following: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/requirements.txt |
The main conversion script in the main branch supports the @tianleiwu Do we still need your special script to convert to FP16 or just using |
The main conversion script could export checkpoint to FP16 model. The model is composed of official ONNX operators, so it could be supported by different execution providers in inference engines (like ONNX Runtime, TensorRT etc) However, inference engines still need process the model and optimize the graph. For example, fuse some part of graph to be custom operators (like MultiHeadAttention, which does not exist in ONNX spec), then dispatch it to optimized CUDA kernel (like Flash Attention). Such optimization is slightly different in different inference engines, and even different execution providers in ONNX Runtime. My script has optimizations of SD for CUDA execution provider of ONNX Runtime only. There is also benchmark to compare the speed with PyTorch+xFormers and PyTorch 2.0. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The torch example gives parameter
revision="fp16"
, can onnx model do the same optimization? Current onnx inference(using CUDAExecutionProvider) is slower than torch version, and used more gpu memory than torch version(12G vs 4G).The text was updated successfully, but these errors were encountered: