-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGL and WebGPU interops #149
Conversation
… WebGPU device. - Make power preference part of context creation options. - Constant operands can be created from either WebGL or WebGPU buffers - Model inputs and outputs can be bound with WebGL or WebGPU textures - Prefix all types with "ML". Simplify "NeuralNetworkContext" to just "MLContext" - Switch to use constructor for MLModelBuilder instead of factory method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wchao1115!
Added some comments on how to make Bikeshed link to externally defined IDL that is not yet in the autolinking database.
index.bs
Outdated
|
||
// Create a model that encapsulates the composition of operands by identifying the output operands. | ||
Model createModel(NamedOperands outputs); | ||
MLModel createModel(MLNamedOperands outputs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still keep MLModel
building device agnostic and leave the device association to compile time? Say to support model.compile(glContext)
and model.compile(gpuDevice)
. With that, developer can build the same model and compile for different devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are interop use cases where the model constants (e.g. weights, etc.) are already uploaded to the GPU buffers before the model is created. In that case it will be inefficient to require the caller to make a roundtrip of that data back to the CPU memory just to construct a device-agnostic model.
As we position WebNN also as a backend API for frameworks, the notion of device-agnostic graph could be in a way of the interop efficiency since by the time the backend graph is initiated (by the frameworks), the model constants may already be mapped to the memory hosted by the frameworks, and for the frameworks already dealing with device resources, those constants could already be uploaded to the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, thanks for the explanation. Given the MLModel
is now device specific, would we consider merging it into the MLCompilation
? Let's say replace the MLModel createModel(MLNamedOperands outputs)
with Promise<MLCompilation> compile(MLNamedOperands outputs)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wchao1115 , any thoughts about this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the logic to it. I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huningxin, the #1 use case is a good one. I plan to address it by adding a device preference option in the MLContextOptions
. This option will be complementary to the existing power preference. As for your #2 I'm still a bit confused. I looked at the scenario that was raised in #156 (copied as follow):
c = tf.conv2d(a, b);
e = tf.conv2d(c, d);
h = tf.conv2d(f, g);
output = await h.data();
If webnn API is going to be used this way, what prevents the caller from compiling all the 3 conv2d
ops into a single compiled graph before executing it in one go? i.e. if the caller has a graph of A + B and that they want the platform-specific intermediate result after A to stay intact before it enters B during execution, then that's exactly what a compiled graph of 'A + B' is for. That's precisely why graph execution is most likely more efficient than op-by-op execution.
If the graph is A + b + C where only A and C are defined as webnn ops, but b is what the caller implements themselves, and that the caller insists that A and C must be executed separately and sequentially, then the end result will be no worse than having A and C defined elsewhere as standalone API calls outside of the webnn graph definition.
The point is that the end result of executing a publicly defined API must not be platform-specific i.e. the resulting tensor of the operation must be in a standard layout, whether that public API is implemented as a compiled webnn graph, or as a standalone API outside of webnn. Hope this helps clarify my point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a device preference option in the
MLContextOptions
. This option will be complementary to the existing power preference.
I like it!
what prevents the caller from compiling all the 3
conv2d
ops into a single compiled graph before executing it in one go?
That's because the framework provides the op API. When executing a graph, the user code just calls the op API to execute the operations of the graph one-by-one. The framework doesn't have the knowledge of the graph, it just executes ops. cc @pyu10055
For this scenario, the framework doesn't know the user code will execute 3 conv2d
. When user code calls a conv2d
, the framework just computes that conv2d
. The framework may use a platform-specific tensor layout in the conv2d
implementation. Until the user code calls h.data()
, the framework starts to convert and copy the tensor data into a ArrayBufferView
in plain layout.
So if the framework uses WebNN to compute conv2d
, it would hurt the performance if WebNN always converts and copies the output tensor data to ArrayBufferView
for each compute.
The point is that the end result of executing a publicly defined API must not be platform-specific
I agree with you, at the interops point, it should use the plain/standard tensor layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because the framework provides the op API. When executing a graph, the user code just calls the op API to execute the operations of the graph one-by-one.
That doesn't sound like what a framework does. Are you saying that there are frameworks that never compiles a graph internally and just relying on individual op execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Framework may provide both op API and model API, for example TensorFlow.js provides Operations API and Models/Loading API. It would depend on what API the user code uses. If the user code uses the op API to execute a graph, the framework won't get the graph but individual ops.
Actually, webnn-polyfill uses TensorFlow.js op API. I suppose it is a common scenario. cc @pyu10055 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wchao1115 @huningxin Thank you Ningxin for brought up this point, op level execution is common for eager execution, especially in JS environment, users could creating graph on the fly. There will be no specific graph boundary, framework would not be able to know which are the intermediate tensors that users would not need to access.
index.bs
Outdated
Promise<MLNamedOutputs> compute(MLNamedInputs inputs, | ||
optional MLNamedOutputs outputs = {}); | ||
|
||
Promise<MLNamedWebGLOutputs> compute(MLNamedWebGLInputs glInputs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the glInputs
(same as gpuInputs
) is device specific, what would happen if this MLCompilation
is compiled for another device? The browser implementation would handle the data transferring or throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will fail. Cross-device, cross-adapter support should not occur at this level. If the context is created with one device/adapter and the input is a resource from another, the API should fail. The callers of WebNN needs to ensure that the all the device-dependent resources are from the device that is bound to the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. We may need to specify this behavior in the following PR.
…upport for accepting GPU and WebGLBuffer as inputs.
…pdate examples accordingly.
And there is a bikeshield build error
The spec still uses {{MLModel}} in Compilation section. |
I like that! |
We may need to define which resource type is acceptable for a specific type of device. For example, if the device is CPU, the typedef (WebGLBuffer or GPUBuffer or WebGLTexture or GPUTexture) GPUResource;
@wchao1115 , I put together above table with two opens. Please take a look and feel free to add anything I missed. |
… instead of the view in the buffer union type.
… and the GPU buffer resource view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks much for the great work.
@anssiko @huningxin It looks like |
I am not aware the manual way to trigger a github action. Hopefully PR #151 would fix it. Please take a look. |
The merge of #151 triggered a rebuild that deployed properly. @wchao1115 @huningxin great work with this PR. This is on our agenda for this week's call (note: it is 1 hour later than usual in the US/Canada!), and that's a good opportunity to summarize the key changes and design considerations discussed in this PR. |
Preview | Diff