Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOOLS]:Using transformers.optimizer optimize large model, segmentation fault (core dumped) #17212

Open
han65487312 opened this issue Aug 18, 2023 · 3 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider

Comments

@han65487312
Copy link

han65487312 commented Aug 18, 2023

Describe the issue

When I use transformers.optimizer to optimize UNet model which is larger than 2GB, the remove_useless_cast_nodes pass will cause segfault. I find that symbolic shape inference in remove_useless_cast_nodes broke down.

The command is:
python3 -m onnxruntime.transformers.optimizer --input ./unet_onnx/original_model/unet.onnx --output ./unet_onnx/fuse_fp16_model/unet.onnx --model_type unet --opt_level 99 --float16 --use_gpu

And when I turn off some optimizations, the optimized model can not run on Tensorrt backend. the error message is " onnx.ModelProto exceeded maximum protobuf size of 2GB: 2357166045". The cudnn backend runs ok.

Here are the library versions I am using:

  • onnx ==1.14.0
  • onnxruntime == 1.16.0
  • torch == 1.12.1
  • protobuf == 3.0.0

To reproduce

The model is too large that I can not upload it here.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 7.5.0-3ubuntu1~18.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU, CUDA, TensorRT

Execution Provider Library Version

No response

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Aug 18, 2023
@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 18, 2023

@han65487312,

segmentation fault (core dumped) might be caused by protobuf. You can downgrade protobuf to 3.20.3 and try again.

the optimizer is for CUDA provider, and it need the UNet to be float32 model, and use --opt_level 0.

The optimizer is not for TensorRT EP because TensorRT has its own graph optimization logic.

For TRT EP, you can try the following for SD 1.5 or 2.1 model:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/onnxruntime_tensorrt_txt2img.py
Basically, it follows same logic in https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion to generate the onnx models for TensorRT backend.

Example code

from onnxruntime.transformers.models.stable_diffusion.onnxruntime_tensorrt_txt2img import OnnxruntimeTensorRTStableDiffusionPipeline
from diffusers.schedulers import DDIMScheduler

model_name_or_path = "runwayml/stable-diffusion-v1-5"
scheduler = DDIMScheduler.from_pretrained(model_name_or_path, subfolder="scheduler")

 pipe = OnnxruntimeTensorRTStableDiffusionPipeline.from_pretrained(
        model_name_or_path,
        revision="fp16",
        torch_dtype=torch.float16,
        scheduler=scheduler,
        image_height=512,
        image_width=512,
        max_batch_size=4,
    )

# re-use cached folder to save ONNX models and TensorRT Engines
pipe.set_cached_folder(model_name_or_path, revision="fp16")

pipe = pipe.to("cuda")

prompt = "photorealistic new zealand hills"
image = pipe(prompt).images[0]
image.save("ort_trt_txt2img_new_zealand_hills.png")

For SDXL, currently we are still working on the optimization.

For more information, see https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

@han65487312
Copy link
Author

han65487312 commented Aug 21, 2023

Thanks for your reply. In deed, my model UNet is a customized model. It's not in diffusers repo. I wonder is there some ways that make attention run in cudnn backend and other optimizations run in trt backend. The steps I use the optimization in transformers.optimizer are 1. export my customized fp32 UNet model. 2. use transformers.optimizer fuse attention layer. 3. run the model by onnxruntime. If I set the --opt_level 0, the step2 fuse the attention layer would not work.

@tianleiwu
Copy link
Contributor

@han65487312,

I wonder is there some ways that make attention run in cudnn backend and other optimizations run in trt backend.
If you use both TRT and CUDA providers in session creation, ORT will partition those fused nodes to CUDA EP, and the others to TRT EP.

However, that might not be a good way to use TRT since TRT need convert NCHW to NHWC layout for the whole graph internally. If you use the optimizer for CUDA EP, TRT could not reach its full potential since TRT only works on subgraphs.

--opt_level 0 is required for ORT <= 1.16 since previously ORT cannot save optimized model > 2GB. This constraint is removed in ORT 1.16 (built from source).

I think TRT could handle model > 2GB since TRT can run SDXL model which is larger than 2GB. @chilo-ms, it there some limitation in TRT EP?

tianleiwu added a commit that referenced this issue Sep 6, 2023
…ta (#17427)

Some initializers are added without raw=True flag. That causes those
tensors cannot be saved to external data. If those tensors exceed 2GB
in total, optimized model cannot be saved due to protobuf limit.

This change will save attention weights and bias in raw data.

Note: it is optional to use raw data for shape tensor since they are
tiny.

### Motivation and Context
#17212
#15349
tianleiwu added a commit that referenced this issue Oct 31, 2023
…ta (#17427)

Some initializers are added without raw=True flag. That causes those
tensors cannot be saved to external data. If those tensors exceed 2GB
in total, optimized model cannot be saved due to protobuf limit.

This change will save attention weights and bias in raw data.

Note: it is optional to use raw data for shape tensor since they are
tiny.

### Motivation and Context
#17212
#15349
kleiti pushed a commit to kleiti/onnxruntime that referenced this issue Mar 22, 2024
…ta (microsoft#17427)

Some initializers are added without raw=True flag. That causes those
tensors cannot be saved to external data. If those tensors exceed 2GB
in total, optimized model cannot be saved due to protobuf limit.

This change will save attention weights and bias in raw data.

Note: it is optional to use raw data for shape tensor since they are
tiny.

### Motivation and Context
microsoft#17212
microsoft#15349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

2 participants