From d996853841ca67285c5a02a2050b33362f06bd77 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Tue, 20 Aug 2024 16:35:02 -0700 Subject: [PATCH 01/14] add MLTensor explainer --- mltensor-explainer.md | 415 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 415 insertions(+) create mode 100644 mltensor-explainer.md diff --git a/mltensor-explainer.md b/mltensor-explainer.md new file mode 100644 index 00000000..7feddc5d --- /dev/null +++ b/mltensor-explainer.md @@ -0,0 +1,415 @@ +# `MLTensor` Explainer + +## Authors + +- [Austin Sullivan](asully@chromium.org) (Google) + +## Participate + +- [Issue tracker](https://github.com/webmachinelearning/webnn/issues) + +## Introduction + +This explainer proposes an `MLTensor` interface which represents a tensor which can be passed as an input and output to `MLGraph` inference. + +The machine learning context underlying WebNN may require input and output tensors to be allocated in a specific fashion, such as with a given byte alignment or on a given compute unit (e.g. CPU, GPU, NPU, TPU, etc...). Currently, this requires that the implementation of an `MLGraph` copy in data from the input tensors, execute the graph, and then copy out data from the output tensors. + +An `MLTensor` is an opaque tensor which may be created, written to, and read from independently from `MLGraph` inference. Each of these operations is performed on the [timeline](\#timelines) of the associated MLContext, with a clearly defined order of operations. Passing `MLTensor`s as input and output tensors to `MLGraph` inference - as opposed to passing `ArrayBufferView`s as is done today - allows for a decoupling of the uploading/downloading of model inputs/outputs from the model execution itself. This provides several benefits, such as buffer reuse, chained inference, explicit memory management, and the opportunity to interop with WebGPU. + +## Goals + +- Provide a consistent API for passing tensors to an `MLGraph` which may run arbitrary compute units (e.g. CPU, GPU, NPU, TPU, etc...) +- Improve model throughput by minimizing the need for synchronization of work via JavaScript +- Minimize overall data copies of graph inputs and outputs +- Best-effort buffer-sharing between WebNN and WebGPU +- Allow a tensor to be reused across multiple `MLGraph`s within the same `MLContext` + +## Non-Goals + +* Guarantee *zero-copy* buffer-sharing between WebNN and WebGPU +* Provide partial views over an `MLTensor` + +## Key Scenarios + +### Buffer Reuse + +A user uploads a large image to a website and wants to try applying several different effects on the same input image. + +The current way that WebNN passes input buffers requires a copy to be made _for each inference_. + +```js +// Current approach to reuse a given input buffer, requiring many data copies + +// Copy the data in `imageAsArrayBuffer` to the required format before executing `graph1`. +// Then, copy the model outputs into `outputArrayBuffer1`. +await mlContext.compute(graph1, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer1}); +// Again, copy all input data in and all output data out. +await mlContext.compute(graph2, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer2}); +// Yet again copy all input data in and all output data out. +await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer3}); +``` + +`MLTensor` allows tensors and their contents to be reused. Once the image data is written to an `MLTensor`, no further copies are required. + +```js +// Proposed approach to reuse a given input buffer, using an input MLTensor + +// Copy the image data into the required format. +const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO}); +mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer); + +// Execute the graphs - no additional copies! +mlContext.dispatch(graph1, {'input': imageAsMlTensor}, {'output': outputMlTensor1}); +mlContext.dispatch(graph2, {'input': imageAsMlTensor}, {'output': outputMlTensor2}); +mlContext.dispatch(graph3, {'input': imageAsMlTensor}, {'output': outputMlTensor3}); +``` + +### Chained Inference + +You may notice another benefit of the code snippet above: each call to `dispatch()` does not require an `await`. + +```js +// Current approach to execute models repeatedly, requiring many round-trips to script + +// The input and output buffers are transferred. We must wait for the buffers to be returned - +// after the graph is executed and output buffers are copied out - before executing the graph +// with these inputs again. +await mlContext.compute(graph1, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer1}); +// The ML context sits idle while control returns to script... +await mlContext.compute(graph2, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer2}); +// ...which just wants to invoke the ML context again. There has to be a better way! +await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer3}); +``` + +Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput. + +```js +// Proposed approach to queue tasks to the ML context timeline + +// Post a task to the ML context timeline to allocate and zero out a tensor, +// then return to script. +const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO}); + +// Post a task to the ML context timeline to write to the tensor. Note that we do +// not await completion of this write. The ML context will ensure any operations +// which depend on the contents of `imageAsMlTensor` will queue behind this task. +mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer); + +// Post a task to the ML context timeline to execute the graph. The ML context will +// ensure this queues behind the write above. +mlContext.dispatch(graph1, {'input': imageAsMlTensor}, {'output': outputMlTensor1}); +// Post another task. Since input tensors will not be modified by graph execution, +// the ML context may choose to execute in parallel to the dispatch above. +mlContext.dispatch(graph2, {'input': imageAsMlTensor}, {'output': outputMlTensor2}); +// Post another task, which may also execute in parallel. +mlContext.dispatch(graph3, {'input': imageAsMlTensor}, {'output': outputMlTensor3}); + +// Post tasks to read the output tensors. These tasks will queue behind the +// respective dispatch() calls using each tensor. +const outputs = await Promise.all([ + outputMlTensor1, + outputMlTensor2, + outputMlTensor3 + ].map((tensor) => { return mlContext.readBuffer(tensor); })); +``` + +Since the queueing mechanism respects data dependencies, chained inference allows an `MLTensor` to be passed as an output from one graph and then immediately as an input to the next. A collection of graphs and buffers may be repeatedly dispatched without the need for synchronization via script. + +```js +// Computing the Nth Fibonacci number using chained inference + +const builder = new MLGraphBuilder(mlContext); +const fn1 = builder.input('F_n-1', {dataType: "int32", shape: [1]}); +const fn2 = builder.input('F_n-2', {dataType: "int32", shape: [1]}); +const add = builder.add(fn1, fn2); +const graph = await builder.build({'F_n': add}); + +const descriptor = {dataType: "int32", shape: [1], usage: MLTensorUsage.WRITE_TO}; +const tensors = await Promise.all([ + mlContext.createTensor(descriptor), + mlContext.createTensor(descriptor), + mlContext.createTensor(descriptor) +]); + +mlContext.writeBuffer(tensors[0], new Int32Array([0])); // F_0 = 0 +mlContext.writeBuffer(tensors[1], new Int32Array([1])); // F_1 = 1 + +for (let n = 2; n <= N; n++) { + // Each dispatch depends on tensors used in the previous dispatch. + mlContext.dispatch(graph, + {'F_n-1': tensors[(n-1) % 3], 'F_n-2': tensors[(n-2) % 3]}, + {'F_n': tensors[n % 3]}); +} + +const f_n = new Int32Array(await mlContext.readBuffer(tensors[N % 3]))[0]; +``` + +### Resource Management + +Let's continue the example above with a large image being used to generate several other large images. Once the user is satisfied with this feature, they want to caption the image using a speech-to-text model. Holding all these tensors and machine learning models at once may put memory pressure on the system. + +```js +// Current approach to resource management, relying on GC + +await mlContext.compute(graph1, inputs, outputs); +// We're done with `graph1`, which may contain multiple gigabytes of weights... + +// ...Let's hope its memory is garbage-collected soon? + +// Construct a new graph which itself needs a lot of resources. +// If the system is under memory pressure, this may fail. +const builder = new MLGraphBuilder(mlContext); +const constant = builder.constant(descriptor, veryLargeBufferOfWeights); +``` + +An `MLTensor`, `MLGraph`, and `MLContext` all have a respective `destroy()` method. Once these objects are no longer needed, the website may request that the memory associated with these resources be released... possibly so that it can run more models! + +```js +// Proposed approach to resource management, with explicit destroy methods + +mlContext.dispatch(graph1, inputs, outputs); + +// We're done with `graph1`, which may contain multiple gigabytes of weights. +// Explicitly ask for its resources to be released! +graph1.destroy(); + +// We can selectively release only the resources we expect won't be needed. +destroyBuffers(inputs); +// Don't destroy the output tensors yet, in case we want to reuse them later. + +// Construct a new graph which itself needs a lot of resources. Memory pressure +// has been relieved and building this graph is much more likely to succeed. +const builder = new MLGraphBuilder(mlContext); +const constant = builder.constant(descriptor, veryLargeBufferOfWeights); +``` + +### WebGPU Interop + +A privacy-conscious user wants to perform real-time selfie segmentation of a video feed on their local device. + +Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`. This is unlikely to be performed in real-time. + +An `MLTensor` may be imported into WebGPU, which in the best case provides zero-copy buffer sharing between the two APIs, and in all cases provides a synchronization mechanism between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines), avoiding the need for expensive synchronization via script. + +```js +// Create a couple MLTensors to be used to facilitate WebGPU interop. +const mlTensor1 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU_INTEROP}); +const mlTensor2 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU_INTEROP}); + +const applyEffectToFrame = async () => { + const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); + + // Rent out the MLTensor to WebGPU. + const tensorizedGpuBuffer = gpuDevice.importExternalBuffer(mlTensor1); + + // Create a bind group for `gpuVideoTexture`, create a command encoder, etc. + // to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer` + // ... + + gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]); + + // Return the buffer to WebNN. + tensorizedGpuBuffer.destroy(); + + // Perform some inference described by `graph` on the frame + // (e.g. selfie segmentation) + mlContext.dispatch( + graph, + /*inputs=*/{'input': mlTensor1}, + /*outputs=*/{'output': mlTensor2}, + ); + + // Rent the other MLTensor out to WebGPU. + const tensorizedGpuBufferAfterInference = gpuDevice.importExternalBuffer(mlTensor2); + + // Create a bind group for `tensorizedGpuBufferAfterInference`, + // create a command encoder, etc to feed `tensorizedGpuBufferAfterInference` + // into a GPU shader which may blur the frame or replace background sections + // and then render the result + // ... + + gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]); + + // Return the buffer to WebNN for the next frame. + tensorizedGpuBufferAfterInference.destroy(); + + // Call this method for each frame. + video.requestVideoFrameCallback(applyEffectToFrame); +} +``` + +## Design Discussion + +### Timelines + +WebNN uses a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model), in that compute tasks are posted to a timeline - which I've referred to as an "ML context timeline" throughout this document - separate from the content timeline (i.e. "script"). See [the WebGPU documentation of timelines](https://gpuweb.github.io/gpuweb/#programming-model-timelines) for more details. + +Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelearning/webnn/issues/529). + +### Device Affinity and Relationship to a `GPUDevice` + +The user agent decides where the memory backing an `MLTensor` is allocated. The WebNN API allows the developer to provide hints - primarily via `MLTensorUsageFlags` - but these are not binding. + +For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. + +The `MLTensorUsage.READ_FROM` and `MLTensorUsage.WRITE_TO` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. + +### Importing an `MLTensor` to WebGPU + +Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how this handoff might work: + +- Two fences are created: + 1. a "start access" fence which is to be signaled by WebNN and waited on by WebGPU. A data copy may be required alongside the signaling of this fence + 2. an "end access" fence which is to be signaled by WebGPU and waited on by WebNN. A data copy may be required alongside the signaling of this fence +- The `GPUDevice` enqueues a command to its `GPUQueue` to wait for the "start access" fence to be signaled +- WebNN will signal the "start access" fence after the completion of all currently-enqueued operations that use the `MLTensor` which is to be imported (this is very similar to how [`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync) works) +- Until the "end access" fence is signaled: + - The `GPUDevice` has exclusive, read/write access to the imported buffer + - All WebNN work involving the imported `MLTensor` is blocked +- When the `GPUBuffer` is destroyed, the "end access" fence is signaled and the `MLTensor` may be used again by WebNN + +### `compute()` vs. `dispatch()` + +`compute()` will be deprecated and removed in favor of `dispatch()`. + +### Open Questions + +- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes)? See [#477](https://github.com/webmachinelearning/webnn/issues/477) +- On non-UMA systems, does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) +- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readBuffer()` and `writeBuffer()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). +- If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired? +- What are the usage flags of a `GPUBuffer` created from an `MLTensor`? + +## Considered Alternatives + +### `MLTensor` as a generic bag of bytes (i.e. `MLBuffer`) + +An `MLTensor` was at one point proposed to be a opaque bag of bytes which could be passed as an input or output buffer to a machine learning graph. However, graph inputs and outputs are expressed not as _buffers_ but _tensors_ with a data type and shape. Generic reinterpretation of an opaque buffer is [not a viable approach on platforms which require typed tensors, such as Core ML](https://github.com/webmachinelearning/webnn/issues/542#issuecomment-2067555410). Since the [use case](https://w3ctag.github.io/design-principles/#usecase-oriented-apis) of an `MLTensor` is as a tensor, an `MLTensor` is effectively typed as a tensor. + +### Support taking a view over an `MLTensor` + +This proposal does not include the ability to arbitrarily reinterpret the attributes or contents of an `MLTensor` - in line with the stance that an `MLTensor` represents a tensor rather than an opaque bag of bytes. Tensor operations such as taking a view over a tensor, reinterpreting a tensor as a new shape, or casting its contents to a new data type map to WebNN's `slice`, `reshape`, and `cast` operators, respectively. To reinterpret an `MLTensor`, these tensor operators may be moved _into the graph itself_ or performed in a separate `MLGraph`. + +You may want to write the following code, but taking a view over an `MLTensor` is not supported: + +```js +// If creating a slice of an MLTensor in script was supported + +const mlTensor = mlContext.createTensor({ dataType: 'int32', shape: [4, 3]}); + +// Create an input which is half the size of `mlTensor`. +const operandOfDesiredShape = builder.input('a', { dataType: 'int32', shape: [2, 3] }); +// ... build the rest of the graph using `operandOfDesiredShape`... + +// Pass a view over the top half of the tensor as a graph input. +// NOTE: THIS SUBSCRIPT METHOD DOES THIS EXIST +const mlTensorView = mlTensor[:2]; +mlContext.dispatch(graph, {'a': mlTensorView}, outputs); +``` + +One way to work around this is by inserting a `slice` operation _within the graph_. + +```js +// Workaround which inserts a slice operation within the graph itself + +const mlTensor = mlContext.createTensor({ dataType: 'int32', shape: [4, 3]}); + +// Create an input which is exactly the size of `mlTensor`, then add a slice +// operation within the graph itself to get the desired view over the input. +const input = builder.input('a', { dataType: 'int32', shape: [4, 3] }); +const operandOfDesiredShape = builder.slice(input, /*starts=*/[0, 0], /*sizes=*/[2, 3]) +// ... build the rest of the graph using `operandOfDesiredShape`... + +// Pass the MLTensor directly. +mlContext.dispatch(graph, {'a': mlTensor}, outputs); +``` + +It's possible that demand may emerge for some mechanism to copy data between `MLTensor`s, though that is currently not planned. + +### Support passing an `MLTensor` as a `constant()` + +There may be an appetite for a mechanism to stream constant weights to the ML context without needing to go through an `ArrayBuffer`, but whether that mechanism should use the `MLTensor` interface is not yet clear. It's not a natural fit because an `MLTensor` is scoped to an `MLContext`, whereas a constant `MLOperand` is scoped to a single `MLGraphBuilder` (and subsequently the `MLGraph` it builds), since sharing constant data across graphs is [not reliably supported on some platforms](https://github.com/webmachinelearning/webnn/issues/614#issuecomment-2021581363). + +### Allow mapping an `MLTensor` to script + +Why doesn't `MLTensor` have a [`mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync) method, as a `GPUBuffer` does? + +`MLTensor`s are used as inputs and outputs to machine learning models. In practice, we expect the size of the model inputs and outputs to be dwarfed by the size of the model weights, which are uploaded via the `MLGraphBuilder.constant()` method. We may re-evaluate this stance in the future if we discover that reading and writing data to an `MLTensor` is a bottleneck, though even in that case we may prefer to explore a solution which bypasses `ArrayBuffer`s altogether, as discussed [above](#support-passing-an-mltensor-as-a-constant). + +### Hash input buffer contents + +One approach to solve the [buffer reuse](#buffer-reuse) case is for the WebNN implementation to silently avoid making redundant data copies if the same buffer contents are repeatedly passed as inputs. This may be achieved by hashing the contents of each input. This approach has downsides, such as [managing the extra copies in the hash map](#resource-management) and that hashing the buffer contents may be expensive as it requires reading the entire input. This approach also does not address the other use cases. + +## References & Acknowledgements + +Many thanks for valuable feedback and advice from: + +- Bryan Bernhart +- Joshua Bell +- Mike Wyrzykowski +- Ningxin Hu +- Phillis Tang +- Rafael Cintron +- Reilly Grant +- Zoltan Kis + +--- + +## Appendix + +### Tentative IDL + +```javascript +typedef [EnforceRange] unsigned long MLTensorUsageFlags; + +namespace MLTensorUsage { + const MLFlagsConstant READ_FROM = 0x0001; + const MLFlagsConstant WRITE_TO = 0x0002; + const MLFlagsConstant WEBGPU_INTEROP = 0x0004; +}; + +dictionary MLTensorDescriptor : MLOperandDescriptor { + required MLTensorUsageFlags usage; +}; + +typedef record MLNamedTensors; + +interface MLTensor { + readonly attribute MLOperandDataType dataType; + readonly attribute FrozenArray shape; + readonly attribute unsigned long MLTensorUsageFlags usage; + + void destroy(); +}; + +partial interface MLContext { + Promise createTensor(MLTensorDescriptor descriptor); + + void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBuffer srcData); + void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBufferView srcData); + + Promise readBuffer(MLTensor srcTensor); + Promise readBuffer(MLTensor srcTensor, [AllowShared] ArrayBuffer dstData); + Promise readBuffer(MLTensor srcTensor, [AllowShared] ArrayBufferView dstData); + + void dispatch(MLGraph graph, MLNamedTensors inputs, MLNamedTensors outputs); +}; + +// For WebGPU Interop + +interface GPUExternalBuffer {}; +GPUExternalBuffer includes GPUObjectBase; + +dictionary GPUExternalBufferDescriptor + : GPUObjectDescriptorBase { + required MLTensor source; +}; + +partial interface GPUDevice { + GPUExternalBuffer importExternalBuffer(GPUExternalBufferDescriptor descriptor); +} + +partial interface ML { + Promise createContext(GPUDevice device); +}; +``` \ No newline at end of file From 68929b9ead47b641b0efb8fb2ae47a9381277158 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Wed, 21 Aug 2024 17:09:56 -0700 Subject: [PATCH 02/14] use correct usages in Fibonacci example + mention a tensor copy is achievable with identify --- mltensor-explainer.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 7feddc5d..5f8b3e96 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -124,11 +124,17 @@ const fn2 = builder.input('F_n-2', {dataType: "int32", shape: [1]}); const add = builder.add(fn1, fn2); const graph = await builder.build({'F_n': add}); -const descriptor = {dataType: "int32", shape: [1], usage: MLTensorUsage.WRITE_TO}; +const usages = [ + MLTensorUsage.WRITE_TO, // To initialize F_0 + MLTensorUsage.WRITE_TO, // To initialize F_1 + 0 +]; +usages[N % 3] |= MLTensorUsage.READ_FROM; // To read the output + const tensors = await Promise.all([ - mlContext.createTensor(descriptor), - mlContext.createTensor(descriptor), - mlContext.createTensor(descriptor) + mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[0]}), + mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[1]}), + mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[2]}) ]); mlContext.writeBuffer(tensors[0], new Int32Array([0])); // F_0 = 0 @@ -324,7 +330,7 @@ const operandOfDesiredShape = builder.slice(input, /*starts=*/[0, 0], /*sizes=*/ mlContext.dispatch(graph, {'a': mlTensor}, outputs); ``` -It's possible that demand may emerge for some mechanism to copy data between `MLTensor`s, though that is currently not planned. +It's possible that demand may emerge for some mechanism to copy data between `MLTensor`s, though that is currently not planned. This may be worked around by creating another graph with an `identity` operation. ### Support passing an `MLTensor` as a `constant()` From c1a80ee39d2cd0f1ef54a71fc30a5177765c11ea Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Fri, 23 Aug 2024 10:43:29 -0700 Subject: [PATCH 03/14] address bberhar feedback: part 1 --- mltensor-explainer.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 5f8b3e96..39ca09df 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -14,7 +14,7 @@ This explainer proposes an `MLTensor` interface which represents a tensor which The machine learning context underlying WebNN may require input and output tensors to be allocated in a specific fashion, such as with a given byte alignment or on a given compute unit (e.g. CPU, GPU, NPU, TPU, etc...). Currently, this requires that the implementation of an `MLGraph` copy in data from the input tensors, execute the graph, and then copy out data from the output tensors. -An `MLTensor` is an opaque tensor which may be created, written to, and read from independently from `MLGraph` inference. Each of these operations is performed on the [timeline](\#timelines) of the associated MLContext, with a clearly defined order of operations. Passing `MLTensor`s as input and output tensors to `MLGraph` inference - as opposed to passing `ArrayBufferView`s as is done today - allows for a decoupling of the uploading/downloading of model inputs/outputs from the model execution itself. This provides several benefits, such as buffer reuse, chained inference, explicit memory management, and the opportunity to interop with WebGPU. +An `MLTensor` is an opaque tensor which may be created, written to, and read from independently from `MLGraph` inference. Each of these operations is performed on the [timeline](\#timelines) of the associated MLContext, with a clearly defined order of operations. Passing `MLTensor`s as input and output tensors to `MLGraph` inference - as opposed to passing `ArrayBufferView`s as is done today - allows for a decoupling of the uploading/downloading of model inputs/outputs from the model execution itself. This provides several benefits, such as buffer reuse, chained inference, explicit destruction, and the opportunity to interop with WebGPU. ## Goals @@ -81,7 +81,7 @@ await mlContext.compute(graph2, {'input': imageAsArrayBuffer}, {'output': output await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer3}); ``` -Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput. +Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput. ```js // Proposed approach to queue tasks to the ML context timeline @@ -254,7 +254,7 @@ Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelea ### Device Affinity and Relationship to a `GPUDevice` -The user agent decides where the memory backing an `MLTensor` is allocated. The WebNN API allows the developer to provide hints - primarily via `MLTensorUsageFlags` - but these are not binding. +The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorUsageFlags`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent. For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. @@ -262,7 +262,7 @@ The `MLTensorUsage.READ_FROM` and `MLTensorUsage.WRITE_TO` flags likewise are hi ### Importing an `MLTensor` to WebGPU -Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how this handoff might work: +Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how the user agent may coordinate this handoff: - Two fences are created: 1. a "start access" fence which is to be signaled by WebNN and waited on by WebGPU. A data copy may be required alongside the signaling of this fence From ac407d9fb7a99926250446938a788897fcf60c72 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Fri, 23 Aug 2024 12:58:39 -0700 Subject: [PATCH 04/14] address bbernhar feedback: part 2 --- mltensor-explainer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 39ca09df..a92ffcad 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -248,7 +248,7 @@ const applyEffectToFrame = async () => { ### Timelines -WebNN uses a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model), in that compute tasks are posted to a timeline - which I've referred to as an "ML context timeline" throughout this document - separate from the content timeline (i.e. "script"). See [the WebGPU documentation of timelines](https://gpuweb.github.io/gpuweb/#programming-model-timelines) for more details. +WebNN uses a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model), in that compute tasks are posted to a timeline - which I've referred to as an "ML context timeline" throughout this document - separate from the content timeline (i.e. "script"). See [the WebGPU documentation of timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines) for more details. Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelearning/webnn/issues/529). @@ -280,7 +280,7 @@ Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be impor ### Open Questions -- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes)? See [#477](https://github.com/webmachinelearning/webnn/issues/477) +- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations sufficient](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878)? See [#477](https://github.com/webmachinelearning/webnn/issues/477) - On non-UMA systems, does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readBuffer()` and `writeBuffer()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). - If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired? From d3e2be575d1879c7c7cd5f438ffc6b6f7d7b3c30 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Fri, 23 Aug 2024 13:11:38 -0700 Subject: [PATCH 05/14] address bbernhar feedback: part 3 --- mltensor-explainer.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index a92ffcad..42beedde 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -278,6 +278,8 @@ Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be impor `compute()` will be deprecated and removed in favor of `dispatch()`. +It's possible `compute()` may have a performance advantage on some platforms for someĀ use cases, such as in one-shot inference (where the benefits of buffer re-use are not relevant) with small inputs/outputs on CPU (where the overhead of task queueing may outweigh the benefits of parallelization). At this time, we do not expect this performance impact to be substantial enough to justify providing two mostly-identical methods for executing an `MLGraph`. + ### Open Questions - How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations sufficient](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878)? See [#477](https://github.com/webmachinelearning/webnn/issues/477) From 56219755f4956bd977f484b3f7b8b8628b7fb8ee Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Fri, 6 Sep 2024 17:43:16 -0700 Subject: [PATCH 06/14] make importExternalBuffer() async (among other changes) --- mltensor-explainer.md | 85 ++++++++++++++++++++++--------------------- 1 file changed, 44 insertions(+), 41 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 42beedde..a3259826 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -27,6 +27,7 @@ An `MLTensor` is an opaque tensor which may be created, written to, and read fro ## Non-Goals * Guarantee *zero-copy* buffer-sharing between WebNN and WebGPU +* Synchronization of work between WebNN and WebGPU without CPU involvement * Provide partial views over an `MLTensor` ## Key Scenarios @@ -55,13 +56,19 @@ await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': output // Proposed approach to reuse a given input buffer, using an input MLTensor // Copy the image data into the required format. -const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO}); -mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer); +const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE}); +mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer); // Execute the graphs - no additional copies! mlContext.dispatch(graph1, {'input': imageAsMlTensor}, {'output': outputMlTensor1}); mlContext.dispatch(graph2, {'input': imageAsMlTensor}, {'output': outputMlTensor2}); mlContext.dispatch(graph3, {'input': imageAsMlTensor}, {'output': outputMlTensor3}); + +await Promise.all([ + mlContext.readTensor(outputMlTensor1, outputArrayBuffer1); + mlContext.readTensor(outputMlTensor2, outputArrayBuffer2); + mlContext.readTensor(outputMlTensor3, outputArrayBuffer3); +]); ``` ### Chained Inference @@ -81,19 +88,19 @@ await mlContext.compute(graph2, {'input': imageAsArrayBuffer}, {'output': output await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer3}); ``` -Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput. +Using `MLTensor` enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeTensor()` call until the work for the last `readTensor()` completes. Better utilization of the ML context will result in significantly better throughput. ```js // Proposed approach to queue tasks to the ML context timeline // Post a task to the ML context timeline to allocate and zero out a tensor, // then return to script. -const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO}); +const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE}); // Post a task to the ML context timeline to write to the tensor. Note that we do // not await completion of this write. The ML context will ensure any operations // which depend on the contents of `imageAsMlTensor` will queue behind this task. -mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer); +mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer); // Post a task to the ML context timeline to execute the graph. The ML context will // ensure this queues behind the write above. @@ -110,7 +117,7 @@ const outputs = await Promise.all([ outputMlTensor1, outputMlTensor2, outputMlTensor3 - ].map((tensor) => { return mlContext.readBuffer(tensor); })); + ].map((tensor) => { return mlContext.readTensor(tensor); })); ``` Since the queueing mechanism respects data dependencies, chained inference allows an `MLTensor` to be passed as an output from one graph and then immediately as an input to the next. A collection of graphs and buffers may be repeatedly dispatched without the need for synchronization via script. @@ -125,11 +132,11 @@ const add = builder.add(fn1, fn2); const graph = await builder.build({'F_n': add}); const usages = [ - MLTensorUsage.WRITE_TO, // To initialize F_0 - MLTensorUsage.WRITE_TO, // To initialize F_1 + MLTensorUsage.WRITE, // To initialize F_0 + MLTensorUsage.WRITE, // To initialize F_1 0 ]; -usages[N % 3] |= MLTensorUsage.READ_FROM; // To read the output +usages[N % 3] |= MLTensorUsage.READ; // To read the output const tensors = await Promise.all([ mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[0]}), @@ -137,8 +144,8 @@ const tensors = await Promise.all([ mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[2]}) ]); -mlContext.writeBuffer(tensors[0], new Int32Array([0])); // F_0 = 0 -mlContext.writeBuffer(tensors[1], new Int32Array([1])); // F_1 = 1 +mlContext.writeTensor(tensors[0], new Int32Array([0])); // F_0 = 0 +mlContext.writeTensor(tensors[1], new Int32Array([1])); // F_1 = 1 for (let n = 2; n <= N; n++) { // Each dispatch depends on tensors used in the previous dispatch. @@ -147,7 +154,7 @@ for (let n = 2; n <= N; n++) { {'F_n': tensors[n % 3]}); } -const f_n = new Int32Array(await mlContext.readBuffer(tensors[N % 3]))[0]; +const f_n = new Int32Array(await mlContext.readTensor(tensors[N % 3]))[0]; ``` ### Resource Management @@ -179,7 +186,8 @@ mlContext.dispatch(graph1, inputs, outputs); // Explicitly ask for its resources to be released! graph1.destroy(); -// We can selectively release only the resources we expect won't be needed. +// We can selectively release only the resources we expect won't be needed +// by calling destroy() on a subset of MLTensors. destroyBuffers(inputs); // Don't destroy the output tensors yet, in case we want to reuse them later. @@ -193,9 +201,9 @@ const constant = builder.constant(descriptor, veryLargeBufferOfWeights); A privacy-conscious user wants to perform real-time selfie segmentation of a video feed on their local device. -Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`. This is unlikely to be performed in real-time. +Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`. -An `MLTensor` may be imported into WebGPU, which in the best case provides zero-copy buffer sharing between the two APIs, and in all cases provides a synchronization mechanism between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines), avoiding the need for expensive synchronization via script. +An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases. ```js // Create a couple MLTensors to be used to facilitate WebGPU interop. @@ -205,8 +213,8 @@ const mlTensor2 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU const applyEffectToFrame = async () => { const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); - // Rent out the MLTensor to WebGPU. - const tensorizedGpuBuffer = gpuDevice.importExternalBuffer(mlTensor1); + // Wait for all ML work involving `mlTensor1` to complete, then rent it out to WebGPU. + const tensorizedGpuBuffer = await gpuDevice.importExternalBuffer(mlTensor1); // Create a bind group for `gpuVideoTexture`, create a command encoder, etc. // to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer` @@ -225,8 +233,8 @@ const applyEffectToFrame = async () => { /*outputs=*/{'output': mlTensor2}, ); - // Rent the other MLTensor out to WebGPU. - const tensorizedGpuBufferAfterInference = gpuDevice.importExternalBuffer(mlTensor2); + // Wait for all ML work involving `mlTensor2` to complete, then rent it out to WebGPU. + const tensorizedGpuBufferAfterInference = await gpuDevice.importExternalBuffer(mlTensor2); // Create a bind group for `tensorizedGpuBufferAfterInference`, // create a command encoder, etc to feed `tensorizedGpuBufferAfterInference` @@ -258,21 +266,15 @@ The WebNN API requires the developer to declare how an `MLTensor` will be used ( For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. -The `MLTensorUsage.READ_FROM` and `MLTensorUsage.WRITE_TO` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. +The `MLTensorUsage.READ` and `MLTensorUsage.WRITE` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. ### Importing an `MLTensor` to WebGPU -Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how the user agent may coordinate this handoff: +Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. + +While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. -- Two fences are created: - 1. a "start access" fence which is to be signaled by WebNN and waited on by WebGPU. A data copy may be required alongside the signaling of this fence - 2. an "end access" fence which is to be signaled by WebGPU and waited on by WebNN. A data copy may be required alongside the signaling of this fence -- The `GPUDevice` enqueues a command to its `GPUQueue` to wait for the "start access" fence to be signaled -- WebNN will signal the "start access" fence after the completion of all currently-enqueued operations that use the `MLTensor` which is to be imported (this is very similar to how [`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync) works) -- Until the "end access" fence is signaled: - - The `GPUDevice` has exclusive, read/write access to the imported buffer - - All WebNN work involving the imported `MLTensor` is blocked -- When the `GPUBuffer` is destroyed, the "end access" fence is signaled and the `MLTensor` may be used again by WebNN +Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads explicitly dependent on WebNN operations, which is may not be possible on platforms which [don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364) and/or don't express ML compute in terms of GPU commands. ### `compute()` vs. `dispatch()` @@ -282,11 +284,12 @@ It's possible `compute()` may have a performance advantage on some platforms for ### Open Questions -- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations sufficient](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878)? See [#477](https://github.com/webmachinelearning/webnn/issues/477) -- On non-UMA systems, does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) -- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readBuffer()` and `writeBuffer()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). +- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878) and losing the `MLContext` sufficient? See [#477](https://github.com/webmachinelearning/webnn/issues/477) +- Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) +- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). - If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired? - What are the usage flags of a `GPUBuffer` created from an `MLTensor`? +- Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`) ## Considered Alternatives @@ -371,8 +374,8 @@ Many thanks for valuable feedback and advice from: typedef [EnforceRange] unsigned long MLTensorUsageFlags; namespace MLTensorUsage { - const MLFlagsConstant READ_FROM = 0x0001; - const MLFlagsConstant WRITE_TO = 0x0002; + const MLFlagsConstant READ = 0x0001; + const MLFlagsConstant WRITE = 0x0002; const MLFlagsConstant WEBGPU_INTEROP = 0x0004; }; @@ -393,12 +396,12 @@ interface MLTensor { partial interface MLContext { Promise createTensor(MLTensorDescriptor descriptor); - void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBuffer srcData); - void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBufferView srcData); + void writeTensor(MLTensor tensor, [AllowShared] ArrayBuffer sourceData); + void writeTensor(MLTensor tensor, [AllowShared] ArrayBufferView sourceData); - Promise readBuffer(MLTensor srcTensor); - Promise readBuffer(MLTensor srcTensor, [AllowShared] ArrayBuffer dstData); - Promise readBuffer(MLTensor srcTensor, [AllowShared] ArrayBufferView dstData); + Promise readTensor(MLTensor tensor); + Promise readTensor(MLTensor tensor, [AllowShared] ArrayBuffer outputData); + Promise readTensor(MLTensor tensor, [AllowShared] ArrayBufferView outputData); void dispatch(MLGraph graph, MLNamedTensors inputs, MLNamedTensors outputs); }; @@ -414,7 +417,7 @@ dictionary GPUExternalBufferDescriptor }; partial interface GPUDevice { - GPUExternalBuffer importExternalBuffer(GPUExternalBufferDescriptor descriptor); + Promise importExternalBuffer(GPUExternalBufferDescriptor descriptor); } partial interface ML { From c3f2e6b172e58d3b3eff4f47d8153f9884600a87 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Wed, 11 Sep 2024 17:22:34 -0700 Subject: [PATCH 07/14] Use GPUExternalBuffer with STORAGE usage flag --- mltensor-explainer.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index a3259826..cd6a85b2 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -223,7 +223,7 @@ const applyEffectToFrame = async () => { gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]); // Return the buffer to WebNN. - tensorizedGpuBuffer.destroy(); + tensorizedGpuBuffer.release(); // Perform some inference described by `graph` on the frame // (e.g. selfie segmentation) @@ -245,7 +245,7 @@ const applyEffectToFrame = async () => { gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]); // Return the buffer to WebNN for the next frame. - tensorizedGpuBufferAfterInference.destroy(); + tensorizedGpuBufferAfterInference.release(); // Call this method for each frame. video.requestVideoFrameCallback(applyEffectToFrame); @@ -272,7 +272,7 @@ The `MLTensorUsage.READ` and `MLTensorUsage.WRITE` flags likewise are hints to t Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. -While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. +While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads explicitly dependent on WebNN operations, which is may not be possible on platforms which [don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364) and/or don't express ML compute in terms of GPU commands. @@ -288,7 +288,6 @@ It's possible `compute()` may have a performance advantage on some platforms for - Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). - If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired? -- What are the usage flags of a `GPUBuffer` created from an `MLTensor`? - Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`) ## Considered Alternatives @@ -408,7 +407,9 @@ partial interface MLContext { // For WebGPU Interop -interface GPUExternalBuffer {}; +interface GPUExternalBuffer { + undefined release(); +}; GPUExternalBuffer includes GPUObjectBase; dictionary GPUExternalBufferDescriptor From f381d21f9e3ee1ac46ef45452081c7ff89d74c18 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Wed, 11 Sep 2024 17:27:03 -0700 Subject: [PATCH 08/14] s/sourceData/inputData --- mltensor-explainer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index cd6a85b2..2f2f589a 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -395,8 +395,8 @@ interface MLTensor { partial interface MLContext { Promise createTensor(MLTensorDescriptor descriptor); - void writeTensor(MLTensor tensor, [AllowShared] ArrayBuffer sourceData); - void writeTensor(MLTensor tensor, [AllowShared] ArrayBufferView sourceData); + void writeTensor(MLTensor tensor, [AllowShared] ArrayBuffer inputData); + void writeTensor(MLTensor tensor, [AllowShared] ArrayBufferView inputData); Promise readTensor(MLTensor tensor); Promise readTensor(MLTensor tensor, [AllowShared] ArrayBuffer outputData); From e74f1aa29fa59ad0e4fe7cc283e94ba15433071b Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Wed, 11 Sep 2024 18:04:49 -0700 Subject: [PATCH 09/14] remove open question about cross-GPU-device interop --- mltensor-explainer.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 2f2f589a..214a288a 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -285,9 +285,8 @@ It's possible `compute()` may have a performance advantage on some platforms for ### Open Questions - How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878) and losing the `MLContext` sufficient? See [#477](https://github.com/webmachinelearning/webnn/issues/477) -- Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) +- Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` or `GPUDevice` is not used to create an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). -- If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired? - Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`) ## Considered Alternatives From 9b274d2debb63c26d120167c1a50b65986b17d4b Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Tue, 1 Oct 2024 11:25:40 -0700 Subject: [PATCH 10/14] address domenic feedback --- mltensor-explainer.md | 57 ++++++++++++++++++++++--------------------- 1 file changed, 29 insertions(+), 28 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 214a288a..34e9970c 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -14,7 +14,7 @@ This explainer proposes an `MLTensor` interface which represents a tensor which The machine learning context underlying WebNN may require input and output tensors to be allocated in a specific fashion, such as with a given byte alignment or on a given compute unit (e.g. CPU, GPU, NPU, TPU, etc...). Currently, this requires that the implementation of an `MLGraph` copy in data from the input tensors, execute the graph, and then copy out data from the output tensors. -An `MLTensor` is an opaque tensor which may be created, written to, and read from independently from `MLGraph` inference. Each of these operations is performed on the [timeline](\#timelines) of the associated MLContext, with a clearly defined order of operations. Passing `MLTensor`s as input and output tensors to `MLGraph` inference - as opposed to passing `ArrayBufferView`s as is done today - allows for a decoupling of the uploading/downloading of model inputs/outputs from the model execution itself. This provides several benefits, such as buffer reuse, chained inference, explicit destruction, and the opportunity to interop with WebGPU. +An `MLTensor` is an opaque tensor which may be created, written to, and read from independently from `MLGraph` inference. Each of these operations is performed on the [timeline](\#timelines) of the associated `MLContext`, with a clearly defined order of operations. Passing `MLTensor`s as input and output tensors to `MLGraph` inference - as opposed to passing `ArrayBufferView`s as is done today - allows for a decoupling of the uploading/downloading of model inputs/outputs from the model execution itself. This provides several benefits, such as buffer reuse, chained inference, explicit destruction, and the opportunity to share memory with WebGPU. ## Goals @@ -56,7 +56,7 @@ await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': output // Proposed approach to reuse a given input buffer, using an input MLTensor // Copy the image data into the required format. -const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE}); +const imageAsMlTensor = await mlContext.createTensor({..., usage: {writable: true}}); mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer); // Execute the graphs - no additional copies! @@ -95,7 +95,7 @@ Using `MLTensor` enables a programming model similar to [WebGPU's](https://www.w // Post a task to the ML context timeline to allocate and zero out a tensor, // then return to script. -const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE}); +const imageAsMlTensor = await mlContext.createTensor({..., usage: {writable: true}}); // Post a task to the ML context timeline to write to the tensor. Note that we do // not await completion of this write. The ML context will ensure any operations @@ -114,10 +114,10 @@ mlContext.dispatch(graph3, {'input': imageAsMlTensor}, {'output': outputMlTensor // Post tasks to read the output tensors. These tasks will queue behind the // respective dispatch() calls using each tensor. const outputs = await Promise.all([ - outputMlTensor1, - outputMlTensor2, - outputMlTensor3 - ].map((tensor) => { return mlContext.readTensor(tensor); })); + mlContext.readTensor(outputMlTensor1), + mlContext.readTensor(outputMlTensor2), + mlContext.readTensor(outputMlTensor3) + ]); ``` Since the queueing mechanism respects data dependencies, chained inference allows an `MLTensor` to be passed as an output from one graph and then immediately as an input to the next. A collection of graphs and buffers may be repeatedly dispatched without the need for synchronization via script. @@ -132,11 +132,11 @@ const add = builder.add(fn1, fn2); const graph = await builder.build({'F_n': add}); const usages = [ - MLTensorUsage.WRITE, // To initialize F_0 - MLTensorUsage.WRITE, // To initialize F_1 - 0 + {writable: true}, // To initialize F_0 + {writable: true}, // To initialize F_1 + {} ]; -usages[N % 3] |= MLTensorUsage.READ; // To read the output +usages[N % 3]['readable'] = true // To read the output const tensors = await Promise.all([ mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[0]}), @@ -206,9 +206,9 @@ Currently, using WebNN for this task would require - for each frame - an expensi An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases. ```js -// Create a couple MLTensors to be used to facilitate WebGPU interop. -const mlTensor1 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU_INTEROP}); -const mlTensor2 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU_INTEROP}); +// Create a couple MLTensors to be shared with WebGPU. +const mlTensor1 = await mlContext.createTensor({..., usage: {importableToWebGPU: true}}); +const mlTensor2 = await mlContext.createTensor({..., usage: {importableToWebGPU: true}}); const applyEffectToFrame = async () => { const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); @@ -262,19 +262,19 @@ Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelea ### Device Affinity and Relationship to a `GPUDevice` -The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorUsageFlags`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent. +The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorUsage`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent. -For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. +For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.importableToWebGPU` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. -The `MLTensorUsage.READ` and `MLTensorUsage.WRITE` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. +The `MLTensorUsage.readable` and `MLTensorUsage.writable` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. ### Importing an `MLTensor` to WebGPU -Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. +Any `MLTensor` created with the `MLTensorUsage.importableToWebGPU` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. -Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads explicitly dependent on WebNN operations, which is may not be possible on platforms which [don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364) and/or don't express ML compute in terms of GPU commands. +Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads - which may involve compositing - explicitly dependent on WebNN operations, which may be inefficient (e.g. if ML compute is not expressed in terms of GPU commands) or impossible (e.g. [some platforms don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364)) on some platforms. ### `compute()` vs. `dispatch()` @@ -287,7 +287,7 @@ It's possible `compute()` may have a performance advantage on some platforms for - How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878) and losing the `MLContext` sufficient? See [#477](https://github.com/webmachinelearning/webnn/issues/477) - Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` or `GPUDevice` is not used to create an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). -- Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`) +- Is a sync variant of the `importExternalBuffer()` method feasible (1) on platforms where completion of ML compute can be signaled on a GPU timeline, or (2) when blocking WebGPU workloads which do not themselves block compositing. ## Considered Alternatives @@ -369,16 +369,14 @@ Many thanks for valuable feedback and advice from: ### Tentative IDL ```javascript -typedef [EnforceRange] unsigned long MLTensorUsageFlags; - -namespace MLTensorUsage { - const MLFlagsConstant READ = 0x0001; - const MLFlagsConstant WRITE = 0x0002; - const MLFlagsConstant WEBGPU_INTEROP = 0x0004; +dictionary MLTensorUsage { + boolean readable = false; + boolean writable = false; + boolean importableToWebGPU = false; }; dictionary MLTensorDescriptor : MLOperandDescriptor { - required MLTensorUsageFlags usage; + MLTensorUsage usage; }; typedef record MLNamedTensors; @@ -386,7 +384,10 @@ typedef record MLNamedTensors; interface MLTensor { readonly attribute MLOperandDataType dataType; readonly attribute FrozenArray shape; - readonly attribute unsigned long MLTensorUsageFlags usage; + // Usage getters + readonly attribute boolean readable; + readonly attribute boolean writable; + readonly attribute boolean importableToWebGPU; void destroy(); }; From 9a4fcb395ad84e7e390dee1c2fa874013774f3d9 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Tue, 1 Oct 2024 20:34:55 -0700 Subject: [PATCH 11/14] inline the formerly-nested MLTensorUsage dict into MLTensorDescriptor --- mltensor-explainer.md | 39 +++++++++++++++++---------------------- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 34e9970c..316ab3b9 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -56,7 +56,7 @@ await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': output // Proposed approach to reuse a given input buffer, using an input MLTensor // Copy the image data into the required format. -const imageAsMlTensor = await mlContext.createTensor({..., usage: {writable: true}}); +const imageAsMlTensor = await mlContext.createTensor({..., writable: true}); mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer); // Execute the graphs - no additional copies! @@ -95,7 +95,7 @@ Using `MLTensor` enables a programming model similar to [WebGPU's](https://www.w // Post a task to the ML context timeline to allocate and zero out a tensor, // then return to script. -const imageAsMlTensor = await mlContext.createTensor({..., usage: {writable: true}}); +const imageAsMlTensor = await mlContext.createTensor({..., writable: true}); // Post a task to the ML context timeline to write to the tensor. Note that we do // not await completion of this write. The ML context will ensure any operations @@ -131,17 +131,17 @@ const fn2 = builder.input('F_n-2', {dataType: "int32", shape: [1]}); const add = builder.add(fn1, fn2); const graph = await builder.build({'F_n': add}); -const usages = [ - {writable: true}, // To initialize F_0 - {writable: true}, // To initialize F_1 - {} +const descriptors = [ + {dataType: "int32", shape: [1], writable: true}, // To initialize F_0 + {dataType: "int32", shape: [1], writable: true}, // To initialize F_1 + {dataType: "int32", shape: [1]} ]; -usages[N % 3]['readable'] = true // To read the output +descriptors[N % 3]['readable'] = true // To read the output const tensors = await Promise.all([ - mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[0]}), - mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[1]}), - mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[2]}) + mlContext.createTensor(descriptors[0]), + mlContext.createTensor(descriptors[1]), + mlContext.createTensor(descriptors[2]) ]); mlContext.writeTensor(tensors[0], new Int32Array([0])); // F_0 = 0 @@ -207,8 +207,8 @@ An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copie ```js // Create a couple MLTensors to be shared with WebGPU. -const mlTensor1 = await mlContext.createTensor({..., usage: {importableToWebGPU: true}}); -const mlTensor2 = await mlContext.createTensor({..., usage: {importableToWebGPU: true}}); +const mlTensor1 = await mlContext.createTensor({..., importableToWebGPU: true}); +const mlTensor2 = await mlContext.createTensor({..., importableToWebGPU: true}); const applyEffectToFrame = async () => { const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); @@ -262,15 +262,15 @@ Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelea ### Device Affinity and Relationship to a `GPUDevice` -The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorUsage`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent. +The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorDescriptor`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent. -For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.importableToWebGPU` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. +For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorDescriptor.importableToWebGPU` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy. -The `MLTensorUsage.readable` and `MLTensorUsage.writable` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. +The `MLTensorDescriptor.readable` and `MLTensorDescriptor.writable` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script. ### Importing an `MLTensor` to WebGPU -Any `MLTensor` created with the `MLTensorUsage.importableToWebGPU` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. +Any `MLTensor` created with the `MLTensorDescriptor.importableToWebGPU` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. @@ -369,22 +369,17 @@ Many thanks for valuable feedback and advice from: ### Tentative IDL ```javascript -dictionary MLTensorUsage { +dictionary MLTensorDescriptor : MLOperandDescriptor { boolean readable = false; boolean writable = false; boolean importableToWebGPU = false; }; -dictionary MLTensorDescriptor : MLOperandDescriptor { - MLTensorUsage usage; -}; - typedef record MLNamedTensors; interface MLTensor { readonly attribute MLOperandDataType dataType; readonly attribute FrozenArray shape; - // Usage getters readonly attribute boolean readable; readonly attribute boolean writable; readonly attribute boolean importableToWebGPU; From 8a603cf7e9dfbc44c7712003f2086a073edade4e Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Mon, 14 Oct 2024 20:21:41 -0700 Subject: [PATCH 12/14] don't explicitly state that cross-GPU imports will be supported --- mltensor-explainer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 316ab3b9..bbdd56f0 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -270,9 +270,9 @@ The `MLTensorDescriptor.readable` and `MLTensorDescriptor.writable` flags likewi ### Importing an `MLTensor` to WebGPU -Any `MLTensor` created with the `MLTensorDescriptor.importableToWebGPU` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. +An `MLTensor` created with the `MLTensorDescriptor.importableToWebGPU` flag may be imported as a `GPUBuffer` to a `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. -While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. +While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`, `GPUBufferUsageFlags.COPY_SRC`, and `GPUBufferUsageFlags.COPY_DST`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads - which may involve compositing - explicitly dependent on WebNN operations, which may be inefficient (e.g. if ML compute is not expressed in terms of GPU commands) or impossible (e.g. [some platforms don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364)) on some platforms. From 4af9354e433a871d48a3845177a767190b6c9696 Mon Sep 17 00:00:00 2001 From: Austin Sullivan Date: Wed, 23 Oct 2024 08:56:42 -0700 Subject: [PATCH 13/14] add WGSL code --- mltensor-explainer.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index bbdd56f0..9fc1da14 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -223,7 +223,7 @@ const applyEffectToFrame = async () => { gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]); // Return the buffer to WebNN. - tensorizedGpuBuffer.release(); + tensorizedGpuBuffer.destroy(); // Perform some inference described by `graph` on the frame // (e.g. selfie segmentation) @@ -272,7 +272,15 @@ The `MLTensorDescriptor.readable` and `MLTensorDescriptor.writable` flags likewi An `MLTensor` created with the `MLTensorDescriptor.importableToWebGPU` flag may be imported as a `GPUBuffer` to a `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer. -While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUExternalBuffer` with `GPUBufferUsageFlags.STORAGE`, `GPUBufferUsageFlags.COPY_SRC`, and `GPUBufferUsageFlags.COPY_DST`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. +While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUBuffer` with `GPUBufferUsageFlags.STORAGE`, `GPUBufferUsageFlags.COPY_SRC`, and `GPUBufferUsageFlags.COPY_DST`. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor. + +The `GPUBuffer` can be accessed as an `array` in WGSL - a 1D packed array of type `T` in GPU memory. The size of the array is determined by the number of bytes of the packed `MLTensor` and `T`. For example, an `MLTensor` with `{dataType: 'int8', shape: [2, 3, 4]}` may be imported as an `array` of length 6. + +``` +// An example of how to declare the imported MLTensor as +// a GPUBuffer in a WGSL shader. +@group(0) @binding(0) var tensor: array; +``` Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads - which may involve compositing - explicitly dependent on WebNN operations, which may be inefficient (e.g. if ML compute is not expressed in terms of GPU commands) or impossible (e.g. [some platforms don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364)) on some platforms. @@ -288,6 +296,7 @@ It's possible `compute()` may have a performance advantage on some platforms for - Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` or `GPUDevice` is not used to create an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). - Is a sync variant of the `importExternalBuffer()` method feasible (1) on platforms where completion of ML compute can be signaled on a GPU timeline, or (2) when blocking WebGPU workloads which do not themselves block compositing. +- The requirement that an imported `GPUBuffer` may be represented as an `array` in WGSL is very restrictive. Could we instead create a `GPUImportendTensor` type which abstracts away the layout of the underlying tensor? ## Considered Alternatives @@ -402,18 +411,13 @@ partial interface MLContext { // For WebGPU Interop -interface GPUExternalBuffer { - undefined release(); -}; -GPUExternalBuffer includes GPUObjectBase; - -dictionary GPUExternalBufferDescriptor +dictionary GPUImportedTensorDescriptor : GPUObjectDescriptorBase { required MLTensor source; }; partial interface GPUDevice { - Promise importExternalBuffer(GPUExternalBufferDescriptor descriptor); + Promise importExternalBuffer(GPUImportedTensorDescriptor descriptor); } partial interface ML { From ebbdf4b4a24f4337e4ce2c491666acb89b86352b Mon Sep 17 00:00:00 2001 From: Dwayne Robinson Date: Fri, 25 Oct 2024 16:44:21 -0700 Subject: [PATCH 14/14] Typo GPUImportendTensor --- mltensor-explainer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mltensor-explainer.md b/mltensor-explainer.md index 9fc1da14..69557c2c 100644 --- a/mltensor-explainer.md +++ b/mltensor-explainer.md @@ -296,7 +296,7 @@ It's possible `compute()` may have a performance advantage on some platforms for - Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` or `GPUDevice` is not used to create an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749) - Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697). - Is a sync variant of the `importExternalBuffer()` method feasible (1) on platforms where completion of ML compute can be signaled on a GPU timeline, or (2) when blocking WebGPU workloads which do not themselves block compositing. -- The requirement that an imported `GPUBuffer` may be represented as an `array` in WGSL is very restrictive. Could we instead create a `GPUImportendTensor` type which abstracts away the layout of the underlying tensor? +- The requirement that an imported `GPUBuffer` may be represented as an `array` in WGSL is very restrictive. Could we instead create a `GPUImportedTensor` type which abstracts away the layout of the underlying tensor? ## Considered Alternatives