Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Rename MLBuffer => MLTensor for WebNN EP #22039

Merged
merged 1 commit into from
Sep 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 36 additions & 36 deletions docs/tutorials/web/ep-webnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,59 +74,59 @@ To use WebNN EP, you just need to make 3 small changes:

WebNN API and WebNN EP are in actively development, you might consider installing the latest nightly build version of ONNX Runtime Web (onnxruntime-web@dev) to benefit from the latest features and improvements.

## Keep tensor data on WebNN MLBuffer (IO binding)
## Keep tensor data on WebNN MLTensor (IO binding)

By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLBuffer to address this issue.
If you get your input data from a MLBuffer, or you want to keep the output data on MLBuffer for further processing, you can use IO binding to keep the data on MLBuffer. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input.
By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLTensor to address this issue.
If you get your input data from a MLTensor, or you want to keep the output data on MLTensor for further processing, you can use IO binding to keep the data on MLTensor. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input.

For model input, if your input data is a WebNN storage MLBuffer, you can [create a MLBuffer tensor and use it as input tensor](#create-input-tensor-from-a-mlbuffer).
For model input, if your input data is a WebNN storage MLTensor, you can [create a MLTensor tensor and use it as input tensor](#create-input-tensor-from-a-mltensor).

For model output, there are 2 ways to use the IO binding feature:
- [Use pre-allocated MLBuffer tensors](#use-pre-allocated-mlbuffer-tensors)
- [Use pre-allocated MLTensor tensors](#use-pre-allocated-mltensor-tensors)
- [Specify the output data location](#specify-the-output-data-location)

Please also check the following topic:
- [MLBuffer tensor life cycle management](#mlbuffer-tensor-life-cycle-management)
- [MLTensor tensor life cycle management](#mltensor-tensor-life-cycle-management)

**Note:** The MLBuffer necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions.
**Note:** The MLTensor necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions.

### Create input tensor from a MLBuffer
### Create input tensor from a MLTensor

If your input data is a WebNN storage MLBuffer, you can create a MLBuffer tensor and use it as input tensor:
If your input data is a WebNN storage MLTensor, you can create a MLTensor tensor and use it as input tensor:

```js
const mlContext = await navigator.ml.createContext({deviceType, ...});
const inputMLBuffer = await mlContext.createBuffer({
const inputMLTensor = await mlContext.createTensor({
dataType: 'float32',
dimensions: [1, 3, 224, 224],
usage: MLBufferUsage.WRITE_TO,
usage: MLTensorUsage.WRITE_TO,
fdwr marked this conversation as resolved.
Show resolved Hide resolved
});

mlContext.writeBuffer(mlBuffer, inputArrayBuffer);
const inputTensor = ort.Tensor.fromMLBuffer(mlBuffer, {
mlContext.writeTensor(inputMLTensor, inputArrayBuffer);
const inputTensor = ort.Tensor.fromMLTensor(inputMLTensor, {
dataType: 'float32',
dims: [1, 3, 224, 224]
});

```

Use this tensor as model inputs(feeds) so that the input data will be kept on MLBuffer.
Use this tensor as model inputs(feeds) so that the input data will be kept on MLTensor.

### Use pre-allocated MLBuffer tensors
### Use pre-allocated MLTensor tensors

If you know the output shape in advance, you can create a MLBuffer tensor and use it as output tensor:
If you know the output shape in advance, you can create a MLTensor tensor and use it as output tensor:

```js

// Create a pre-allocated buffer and the corresponding tensor. Assuming that the output shape is [10, 1000].
// Create a pre-allocated MLTensor and the corresponding ORT tensor. Assuming that the output shape is [10, 1000].
const mlContext = await navigator.ml.createContext({deviceType, ...});
const myPreAllocatedBuffer = await mlContext.createBuffer({
const myPreAllocatedMLTensor = await mlContext.createTensor({
dataType: 'float32',
dimensions: [10, 1000],
usage: MLBufferUsage.READ_FROM,
usage: MLTensorUsage.READ_FROM,
});

const myPreAllocatedOutputTensor = ort.Tensor.fromMLBuffer(myPreAllocatedBuffer, {
const myPreAllocatedOutputTensor = ort.Tensor.fromMLTensor(myPreAllocatedMLTensor, {
dataType: 'float32',
dims: [10, 1000]
});
Expand All @@ -140,25 +140,25 @@ const results = await mySession.run(feeds, fetches);

```

By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated buffer as the output buffer. If there is a shape mismatch, the `run()` call will fail.
By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated MLTensor as the output tensor. If there is a shape mismatch, the `run()` call will fail.

### Specify the output data location

If you don't want to use pre-allocated MLBuffer tensors for outputs, you can also specify the output data location in the session options:
If you don't want to use pre-allocated MLTensor tensors for outputs, you can also specify the output data location in the session options:

```js
const mySessionOptions1 = {
...,
// keep all output data on MLBuffer
preferredOutputLocation: 'ml-buffer'
// keep all output data on MLTensor
preferredOutputLocation: 'ml-tensor'
fdwr marked this conversation as resolved.
Show resolved Hide resolved
};

const mySessionOptions2 = {
...,
// alternatively, you can specify the output location for each output tensor
preferredOutputLocation: {
'output_0': 'cpu', // keep output_0 on CPU. This is the default behavior.
'output_1': 'ml-buffer' // keep output_1 on MLBuffer buffer
'output_1': 'ml-tensor' // keep output_1 on MLTensor tensor
}
};
```
Expand All @@ -169,18 +169,18 @@ See [API reference: preferredOutputLocation](https://onnxruntime.ai/docs/api/js/

## Notes

### MLBuffer tensor life cycle management
### MLTensor tensor life cycle management

It is important to understand how the underlying MLBuffer is managed so that you can avoid memory leaks and improve buffer usage efficiency.
It is important to understand how the underlying MLTensor is managed so that you can avoid memory leaks and improve tensor usage efficiency.

A MLBuffer tensor is created either by user code or by ONNX Runtime Web as model's output.
- When it is created by user code, it is always created with an existing MLBuffer using `Tensor.fromMLBuffer()`. In this case, the tensor does not "own" the MLBuffer.
A MLTensor tensor is created either by user code or by ONNX Runtime Web as model's output.
- When it is created by user code, it is always created with an existing MLTensor using `Tensor.fromMLTensor()`. In this case, the tensor does not "own" the MLTensor.

- It is user's responsibility to make sure the underlying buffer is valid during the inference, and call `mlBuffer.destroy()` to dispose the buffer when it is no longer needed.
- Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLBuffer directly.
- Using a MLBuffer tensor with a destroyed MLBuffer will cause the session run to fail.
- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLBuffer tensor), the tensor "owns" the buffer.
- It is user's responsibility to make sure the underlying MLTensor is valid during the inference, and call `mlTensor.destroy()` to dispose the MLTensor when it is no longer needed.
- Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLTensor tensor directly.
- Using a MLTensor tensor with a destroyed MLTensor will cause the session run to fail.
- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLTensor tensor), the tensor "owns" the MLTensor.

- You don't need to worry about the case that the buffer is destroyed before the tensor is used.
- Call `tensor.getData()` to download the data from the MLBuffer to CPU and get the data as a typed array.
- Call `tensor.dispose()` explicitly to destroy the underlying MLBuffer when it is no longer needed.
- You don't need to worry about the case that the MLTensor is destroyed before the tensor is used.
- Call `tensor.getData()` to download the data from the MLTensor to CPU and get the data as a typed array.
- Call `tensor.dispose()` explicitly to destroy the underlying MLTensor when it is no longer needed.
Loading