Add IO binding support for custom ORTModel #447

JingyaHuang · 2022-11-03T16:55:23Z

What does this PR do?

In #421, Optimum added IO binding support for task-defined models. In this PR, we will add IO binding support for custom tasks.

For custom cases, Optimum is unaware of the outputs and their size information. In this case, we will let ONNX Runtime allocate memory for the output as an OrtValue. So a transfer of tensor across frameworks will be needed.

If onnxruntime-training is installed, with the dlpack support, the output tensors will be transferred from ONNX runtime to PyTorch on the device directly.
If onnxruntime-gpu is installed, the ownership of the output tensor will be transferred from ORT -> CuPy -> PyTorch.

This PR is currently on stand-by and waiting for requests from the community.

HuggingFaceDocBuilderDev · 2022-11-03T17:38:25Z

The documentation is not available anymore as the PR was closed or merged.

JingyaHuang · 2022-12-05T18:37:52Z

As in the PR #539, the output logits would have variable shapes depending on the model used for the segmentation task(at least there is no one general rule to infer the output shape for all these models), I would like to ship this PR to enable the IOBinding for outputs that we can't infer the shape in advance. @michaelbenayoun @fxmarty

And FYI @TheoMrc

TheoMrc · 2022-12-05T19:33:36Z

Thanks for the heads up !

I wonder, how does io binding work when you cannot bind a fixed output size ?
IO binding if I understand well consists in preparing GPU memory for input + output data, allowing to allocate and fetch output data faster from GPU memory.
Do you plan to infer the first time without output binding and then using output size for further predictions ?
Or maybe output binding would be done just in time preceding the last step of inference ?

Just a last question to satisfy my curiosity, is there no way to read output size from model ? I remember I could quite easily do so on tensorflow during my ML debuts (tf.keras.utils.plot_model)

Have a good evening,
Theo

Update: Found this on SO: ONNX graph - How to get output-dimensions

from onnx import shape_inference
inferred_model = shape_inference.infer_shapes(original_model)

and find the shape info in inferred_model.graph.value_info.

TheoMrc · 2022-12-05T19:44:47Z

Maybe we could shape_inference.infer_shapes(model) upon model loading to get ouput size in advance for IO binding?

JingyaHuang · 2022-12-06T10:32:17Z

Hi @TheoMrc,

Under the hood, ONNX Runtime uses a data structure named OrtValue(could be tensors or non-tensors). If the shape is properly given, ORT binds I/O as a tensor owning and managing its memory. Otherwise, output OrtValue must be allocated on a specific device, and its memory would be allocated and managed by ONNX Runtime. You would be able to get the address and shape information of the OrtValue through either dlpack or cupy.

Yes, the aim of IOBinding is to prepare data buffers on the device to avoid offloading them to the CPU and causing data copying overhead when you want to reuse the data for the following compute on the device. That's why adding IOBinding is especially important for decoding as you will need to reuse the output of the last step.

I don't think that you can infer the shape, you can get that info easily if you are under eager mode, but in this case, you are using the graph, and the outputs shape is dynamic, you will probably get shape for some initializers and unk_xxx for the output shape (although you can make it static while exporting the ONNX graph, but it is not the case in Optimum, as we want it to be flexible and reusable).

michaelbenayoun

LGTM, thanks @JingyaHuang !!

docs/source/onnxruntime/package_reference/modeling_ort.mdx

optimum/onnxruntime/modeling_ort.py

optimum/onnxruntime/utils.py

fxmarty

LGTM, thanks!

Makes me think out loud that for the ORTModel we could have either a fixed input/output path, or a more flexible one, or something in this flavor to solve e.g. #479

JingyaHuang · 2022-12-06T15:47:18Z

@fxmarty

For the ORT modeling, it is already support for custom task while using one single ONNX model for inference(though penalty for being dynamic), and we can implement a custom one in case of seq2seq, WDYT?

fxmarty · 2022-12-06T16:29:53Z

What I meant is that if you use ORTModelForSequenceClassification with a custom input/output (e.g., global_attention_mask), we fail!

JingyaHuang · 2022-12-06T16:34:30Z

@fxmarty In this case, use ORTModelForCustomTasks. From my understanding, other task-defined models are implemented in a static manner in order to avoid any penalty for the pipeline. @philschmid

What I do agree is that, to fully enable the customizability, the exporter should support user to customize what they want to take in as inputs and outputs.

philschmid · 2022-12-06T17:28:12Z

@fxmarty In this case, use ORTModelForCustomTasks. From my understanding, other task-defined models are implemented in a static manner in order to avoid any penalty for the pipeline. @philschmid

Yes the idea is to keep the ORTModelForXXX lean and avoid any dynamic custom features to have the least overhead on top of the inferenceSession even if that means a small percentage of special tasks/model won't have support and that's why @JingyaHuang created the ORTModelForCustomTask

JingyaHuang added 3 commits November 3, 2022 16:40

Add IO binding for ORTModelForCustomTasks

c65b8c7

Add test

2a3a19c

Fix test model

572be45

JingyaHuang mentioned this pull request Nov 10, 2022

Add support ORT whisper #420

Merged

3 tasks

anton-l mentioned this pull request Nov 30, 2022

[ONNX] PyTorch IO binding for faster GPU inference huggingface/diffusers#1452

Closed

JingyaHuang mentioned this pull request Dec 4, 2022

Add the ORTModelForSemanticSegmentation class #539

Merged

Merge branch 'main' into ort-custom-io

4774abe

JingyaHuang requested review from philschmid, michaelbenayoun and fxmarty December 5, 2022 18:10

JingyaHuang added 2 commits December 5, 2022 18:53

Force inputs to be contiguous

a20dd74

Fix docstring

fd772ae

Merge branch 'main' into ort-custom-io

b97644f

michaelbenayoun approved these changes Dec 6, 2022

View reviewed changes

fxmarty approved these changes Dec 6, 2022

View reviewed changes

Nits

04891f9

JingyaHuang merged commit 8b559db into main Dec 6, 2022

JingyaHuang deleted the ort-custom-io branch December 6, 2022 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IO binding support for custom ORTModel #447

Add IO binding support for custom ORTModel #447

JingyaHuang commented Nov 3, 2022

HuggingFaceDocBuilderDev commented Nov 3, 2022 •

edited

Loading

JingyaHuang commented Dec 5, 2022

TheoMrc commented Dec 5, 2022 •

edited

Loading

TheoMrc commented Dec 5, 2022

JingyaHuang commented Dec 6, 2022

michaelbenayoun left a comment

fxmarty left a comment

JingyaHuang commented Dec 6, 2022 •

edited

Loading

fxmarty commented Dec 6, 2022

JingyaHuang commented Dec 6, 2022 •

edited

Loading

philschmid commented Dec 6, 2022

Add IO binding support for custom ORTModel #447

Add IO binding support for custom ORTModel #447

Conversation

JingyaHuang commented Nov 3, 2022

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 3, 2022 • edited Loading

JingyaHuang commented Dec 5, 2022

TheoMrc commented Dec 5, 2022 • edited Loading

TheoMrc commented Dec 5, 2022

JingyaHuang commented Dec 6, 2022

michaelbenayoun left a comment

Choose a reason for hiding this comment

fxmarty left a comment

Choose a reason for hiding this comment

JingyaHuang commented Dec 6, 2022 • edited Loading

fxmarty commented Dec 6, 2022

JingyaHuang commented Dec 6, 2022 • edited Loading

philschmid commented Dec 6, 2022

HuggingFaceDocBuilderDev commented Nov 3, 2022 •

edited

Loading

TheoMrc commented Dec 5, 2022 •

edited

Loading

JingyaHuang commented Dec 6, 2022 •

edited

Loading

JingyaHuang commented Dec 6, 2022 •

edited

Loading