Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IO binding support for custom ORTModel #447

Merged
merged 8 commits into from
Dec 6, 2022
Merged

Conversation

JingyaHuang
Copy link
Contributor

What does this PR do?

In #421, Optimum added IO binding support for task-defined models. In this PR, we will add IO binding support for custom tasks.

For custom cases, Optimum is unaware of the outputs and their size information. In this case, we will let ONNX Runtime allocate memory for the output as an OrtValue. So a transfer of tensor across frameworks will be needed.

  • If onnxruntime-training is installed, with the dlpack support, the output tensors will be transferred from ONNX runtime to PyTorch on the device directly.
  • If onnxruntime-gpu is installed, the ownership of the output tensor will be transferred from ORT -> CuPy -> PyTorch.

This PR is currently on stand-by and waiting for requests from the community.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 3, 2022

The documentation is not available anymore as the PR was closed or merged.

@JingyaHuang
Copy link
Contributor Author

As in the PR #539, the output logits would have variable shapes depending on the model used for the segmentation task(at least there is no one general rule to infer the output shape for all these models), I would like to ship this PR to enable the IOBinding for outputs that we can't infer the shape in advance. @michaelbenayoun @fxmarty

And FYI @TheoMrc

@TheoMrc
Copy link
Contributor

TheoMrc commented Dec 5, 2022

Thanks for the heads up !

I wonder, how does io binding work when you cannot bind a fixed output size ?
IO binding if I understand well consists in preparing GPU memory for input + output data, allowing to allocate and fetch output data faster from GPU memory.
Do you plan to infer the first time without output binding and then using output size for further predictions ?
Or maybe output binding would be done just in time preceding the last step of inference ?

Just a last question to satisfy my curiosity, is there no way to read output size from model ? I remember I could quite easily do so on tensorflow during my ML debuts (tf.keras.utils.plot_model)

Have a good evening,
Theo

Update: Found this on SO: ONNX graph - How to get output-dimensions

from onnx import shape_inference
inferred_model = shape_inference.infer_shapes(original_model)

and find the shape info in inferred_model.graph.value_info.

@TheoMrc
Copy link
Contributor

TheoMrc commented Dec 5, 2022

Maybe we could shape_inference.infer_shapes(model) upon model loading to get ouput size in advance for IO binding?

@JingyaHuang
Copy link
Contributor Author

Hi @TheoMrc,

Under the hood, ONNX Runtime uses a data structure named OrtValue(could be tensors or non-tensors). If the shape is properly given, ORT binds I/O as a tensor owning and managing its memory. Otherwise, output OrtValue must be allocated on a specific device, and its memory would be allocated and managed by ONNX Runtime. You would be able to get the address and shape information of the OrtValue through either dlpack or cupy.

Yes, the aim of IOBinding is to prepare data buffers on the device to avoid offloading them to the CPU and causing data copying overhead when you want to reuse the data for the following compute on the device. That's why adding IOBinding is especially important for decoding as you will need to reuse the output of the last step.

I don't think that you can infer the shape, you can get that info easily if you are under eager mode, but in this case, you are using the graph, and the outputs shape is dynamic, you will probably get shape for some initializers and unk_xxx for the output shape (although you can make it static while exporting the ONNX graph, but it is not the case in Optimum, as we want it to be flexible and reusable).

Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @JingyaHuang !!

docs/source/onnxruntime/package_reference/modeling_ort.mdx Outdated Show resolved Hide resolved
optimum/onnxruntime/modeling_ort.py Outdated Show resolved Hide resolved
optimum/onnxruntime/utils.py Outdated Show resolved Hide resolved
optimum/onnxruntime/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fxmarty fxmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Makes me think out loud that for the ORTModel we could have either a fixed input/output path, or a more flexible one, or something in this flavor to solve e.g. #479

@JingyaHuang
Copy link
Contributor Author

JingyaHuang commented Dec 6, 2022

@fxmarty

For the ORT modeling, it is already support for custom task while using one single ONNX model for inference(though penalty for being dynamic), and we can implement a custom one in case of seq2seq, WDYT?

@fxmarty
Copy link
Contributor

fxmarty commented Dec 6, 2022

What I meant is that if you use ORTModelForSequenceClassification with a custom input/output (e.g., global_attention_mask), we fail!

@JingyaHuang
Copy link
Contributor Author

JingyaHuang commented Dec 6, 2022

@fxmarty In this case, use ORTModelForCustomTasks. From my understanding, other task-defined models are implemented in a static manner in order to avoid any penalty for the pipeline. @philschmid

What I do agree is that, to fully enable the customizability, the exporter should support user to customize what they want to take in as inputs and outputs.

@philschmid
Copy link
Member

@fxmarty In this case, use ORTModelForCustomTasks. From my understanding, other task-defined models are implemented in a static manner in order to avoid any penalty for the pipeline. @philschmid

Yes the idea is to keep the ORTModelForXXX lean and avoid any dynamic custom features to have the least overhead on top of the inferenceSession even if that means a small percentage of special tasks/model won't have support and that's why @JingyaHuang created the ORTModelForCustomTask

@JingyaHuang JingyaHuang merged commit 8b559db into main Dec 6, 2022
@JingyaHuang JingyaHuang deleted the ort-custom-io branch December 6, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants