Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGL and WebGPU interops #149

Merged
merged 6 commits into from
Mar 15, 2021
Merged

WebGL and WebGPU interops #149

merged 6 commits into from
Mar 15, 2021

Conversation

wchao1115
Copy link
Collaborator

@wchao1115 wchao1115 commented Mar 1, 2021


Preview | Diff

… WebGPU device.

- Make power preference part of context creation options.
- Constant operands can be created from either WebGL or WebGPU buffers
- Model inputs and outputs can be bound with WebGL or WebGPU textures
- Prefix all types with "ML". Simplify "NeuralNetworkContext" to just "MLContext"
- Switch to use constructor for MLModelBuilder instead of factory method
Copy link
Member

@anssiko anssiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wchao1115!

Added some comments on how to make Bikeshed link to externally defined IDL that is not yet in the autolinking database.

index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
@wchao1115
Copy link
Collaborator Author

@RafaelCintron

index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated

// Create a model that encapsulates the composition of operands by identifying the output operands.
Model createModel(NamedOperands outputs);
MLModel createModel(MLNamedOperands outputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still keep MLModel building device agnostic and leave the device association to compile time? Say to support model.compile(glContext) and model.compile(gpuDevice). With that, developer can build the same model and compile for different devices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are interop use cases where the model constants (e.g. weights, etc.) are already uploaded to the GPU buffers before the model is created. In that case it will be inefficient to require the caller to make a roundtrip of that data back to the CPU memory just to construct a device-agnostic model.

As we position WebNN also as a backend API for frameworks, the notion of device-agnostic graph could be in a way of the interop efficiency since by the time the backend graph is initiated (by the frameworks), the model constants may already be mapped to the memory hosted by the frameworks, and for the frameworks already dealing with device resources, those constants could already be uploaded to the device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, thanks for the explanation. Given the MLModel is now device specific, would we consider merging it into the MLCompilation? Let's say replace the MLModel createModel(MLNamedOperands outputs) with Promise<MLCompilation> compile(MLNamedOperands outputs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wchao1115 , any thoughts about this idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the logic to it. I think it makes sense.

Copy link
Collaborator Author

@wchao1115 wchao1115 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huningxin, the #1 use case is a good one. I plan to address it by adding a device preference option in the MLContextOptions. This option will be complementary to the existing power preference. As for your #2 I'm still a bit confused. I looked at the scenario that was raised in #156 (copied as follow):

c = tf.conv2d(a, b);
e = tf.conv2d(c, d);
h = tf.conv2d(f, g);
output = await h.data();

If webnn API is going to be used this way, what prevents the caller from compiling all the 3 conv2d ops into a single compiled graph before executing it in one go? i.e. if the caller has a graph of A + B and that they want the platform-specific intermediate result after A to stay intact before it enters B during execution, then that's exactly what a compiled graph of 'A + B' is for. That's precisely why graph execution is most likely more efficient than op-by-op execution.

If the graph is A + b + C where only A and C are defined as webnn ops, but b is what the caller implements themselves, and that the caller insists that A and C must be executed separately and sequentially, then the end result will be no worse than having A and C defined elsewhere as standalone API calls outside of the webnn graph definition.

The point is that the end result of executing a publicly defined API must not be platform-specific i.e. the resulting tensor of the operation must be in a standard layout, whether that public API is implemented as a compiled webnn graph, or as a standalone API outside of webnn. Hope this helps clarify my point of view.

Copy link
Contributor

@huningxin huningxin Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a device preference option in the MLContextOptions. This option will be complementary to the existing power preference.

I like it!

what prevents the caller from compiling all the 3 conv2d ops into a single compiled graph before executing it in one go?

That's because the framework provides the op API. When executing a graph, the user code just calls the op API to execute the operations of the graph one-by-one. The framework doesn't have the knowledge of the graph, it just executes ops. cc @pyu10055

For this scenario, the framework doesn't know the user code will execute 3 conv2d. When user code calls a conv2d, the framework just computes that conv2d. The framework may use a platform-specific tensor layout in the conv2d implementation. Until the user code calls h.data(), the framework starts to convert and copy the tensor data into a ArrayBufferView in plain layout.

So if the framework uses WebNN to compute conv2d, it would hurt the performance if WebNN always converts and copies the output tensor data to ArrayBufferView for each compute.

The point is that the end result of executing a publicly defined API must not be platform-specific

I agree with you, at the interops point, it should use the plain/standard tensor layout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the framework provides the op API. When executing a graph, the user code just calls the op API to execute the operations of the graph one-by-one.

That doesn't sound like what a framework does. Are you saying that there are frameworks that never compiles a graph internally and just relying on individual op execution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Framework may provide both op API and model API, for example TensorFlow.js provides Operations API and Models/Loading API. It would depend on what API the user code uses. If the user code uses the op API to execute a graph, the framework won't get the graph but individual ops.

Actually, webnn-polyfill uses TensorFlow.js op API. I suppose it is a common scenario. cc @pyu10055 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wchao1115 @huningxin Thank you Ningxin for brought up this point, op level execution is common for eager execution, especially in JS environment, users could creating graph on the fly. There will be no specific graph boundary, framework would not be able to know which are the intermediate tensors that users would not need to access.

index.bs Outdated
Promise<MLNamedOutputs> compute(MLNamedInputs inputs,
optional MLNamedOutputs outputs = {});

Promise<MLNamedWebGLOutputs> compute(MLNamedWebGLInputs glInputs,
Copy link
Contributor

@huningxin huningxin Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the glInputs (same as gpuInputs) is device specific, what would happen if this MLCompilation is compiled for another device? The browser implementation would handle the data transferring or throw an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will fail. Cross-device, cross-adapter support should not occur at this level. If the context is created with one device/adapter and the input is a resource from another, the API should fail. The callers of WebNN needs to ensure that the all the device-dependent resources are from the device that is bound to the context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. We may need to specify this behavior in the following PR.

@wchao1115 wchao1115 requested a review from pyu10055 March 4, 2021 16:17
…upport for accepting GPU and WebGLBuffer as inputs.
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
@huningxin
Copy link
Contributor

And there is a bikeshield build error

LINK ERROR: No 'idl' refs found for 'MLModel'.
{{MLModel}}

The spec still uses {{MLModel}} in Compilation section.

@wchao1115
Copy link
Collaborator Author

MLModelBuilder -> MLGraphBuilder
MLModelBuilder.compile -> MLGraphBuilder.build
MLCompilation -> MLGraph

I like that!

@huningxin
Copy link
Contributor

The constants, inputs, and outputs can still be in ArrayBufferView in which case the implementation of webnn should still upload the resources themselves if it is running on a hardware device.

We may need to define which resource type is acceptable for a specific type of device. For example, if the device is CPU, the WebGLBuffer or WebGPUBuffer resources may not be acceptable?

typedef (WebGLBuffer or GPUBuffer or WebGLTexture or GPUTexture) GPUResource;
device type resource type acceptable
cpu ArrayBuffer ✔️
cpu GPUResource ❌ (?)
gpu ArrayBuffer ✔️
gpu GPUResource (on the same adapter) ✔️
gpu GPUResource (on the different adapter)
accelerator ArrayBuffer ✔️
accelerator GPUResource ❌ (?)

@wchao1115 , I put together above table with two opens. Please take a look and feel free to add anything I missed.

… instead of the view in the buffer union type.
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
Copy link
Contributor

@huningxin huningxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks much for the great work.

@wchao1115 wchao1115 merged commit c74ec24 into master Mar 15, 2021
@wchao1115 wchao1115 deleted the wchao/interop branch March 16, 2021 00:23
@wchao1115
Copy link
Collaborator Author

@anssiko @huningxin It looks like deploy.yaml doesn't run on the merged change. So the generated html file is now stale. Is there a way to manually trigger it to run?

@huningxin
Copy link
Contributor

@anssiko @huningxin It looks like deploy.yaml doesn't run on the merged change. So the generated html file is now stale. Is there a way to manually trigger it to run?

I am not aware the manual way to trigger a github action. Hopefully PR #151 would fix it. Please take a look.

@anssiko
Copy link
Member

anssiko commented Mar 16, 2021

The merge of #151 triggered a rebuild that deployed properly.

@wchao1115 @huningxin great work with this PR. This is on our agenda for this week's call (note: it is 1 hour later than usual in the US/Canada!), and that's a good opportunity to summarize the key changes and design considerations discussed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants