-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow checking whether operators/types are supported for a backend before creating a graph #463
Comments
🤔 Or perhaps WebNN needs a weight loading step? This is essentially the pattern I see with popular models (e.g. Real-ESRGAN, Stable diffusion). The graph describes the architecture and doesn't include weights. If we want to fail before building the grpah, I'm afraid the operand methods (e.g. conv2d) needs to be async (to allow querying the backend), or make every feature feature-detectable (think of sending a feature matrix to the renderer). |
This is a great idea. According to the current WebNN spec, the WebGPU context has We probably can reuse this method for default context and extend it for weights uploading. For example, If the constants/weights uploading could be decoupled from graph building, developers can build the graph topology/architecture first before downloading the weights. If there are any unsupported error, they may handle the error or stop there. Only when the graph building succeeds, they proceed to download the weights from network and upload to device by |
Weight loading would substantially change how We need to match constant nodes with weight buffers. Probably do-able if we ask clients to retain references to constant MLOperand and pass The overall usage will be: // Build graph
const x = graphBuilder.input()
let constNode = graphBuilder.constant(shape)
const y = graphBuilder.add(x, constNode)
graphBuilder.constant()
// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build(y)
// Download weight, match them with array buffer
const weightNodes = new Map()
weightNodes.set(constNode, arrayBuffer) // Note the `constNode` reference here
// Load weight and make graph computable.
away graph.loadWeight(weightNodes, [arrayBuffer1])
// Compute
await MLContext.compute(graph, {...inputs}) I'd recommend including this issue in v2. Diffuser / LLM are both large. Stable diffusion XL already hit 7GB (fp16). Maybe its size will double again this year :) |
Alternatively, we may reuse
The usage looks good. I slightly modified the pseudo code by using // Build graph
const x = graphBuilder.input('x', shape)
let constNode = graphBuilder.constant('constNode', shape)
const y = graphBuilder.add(x, constNode)
// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build({y})
// Download weight, match them with array buffer by name
const weights = {'constNode': arrayBuffer}
// Load weight and make graph computable.
await graph.loadWeight(weights)
// Compute
await MLContext.compute(graph, {...inputs}) |
I had this same thought when reviewing Chromium CL 5232604. In that example there are a number of input shapes which aren't supported by XNNPACK. Generating these errors at build time rather than when the In the XNNPACK case it is easy enough to do these checks earlier and in the renderer process, but I can see how for backends with a more elaborate build step this becomes much more difficult. Perhaps a build error API could return a reference to the problematic |
what's the expected behavior for the framework when an operator/type is not supported? If we just want to fail quick, then make weight loading separate is sufficient. But if we want to allow frameworks to adjust the graph partition/ deploy a different model that works for the current webnn backend, then we do need to be able to feature detect in some way to surface the particular unsupported features. |
Hi! Given we are having more and more data type support differences across platforms (e.g. #653, #675, #283 (comment)), I'd like to make more progress on this issue. What do folks think about this structure that starts with just exposing the data type support level?
Probing supported feature limits should be tied to a MLContext as it’s particular to the underlying devices. In this proposal “dataType” is in a nested field to make it more extensible, so later we can specify other limits like This will be generated using the intersection between the WebNN supported types(for chromium, currently defined in graph_validation_utils) and underlying platform’s supported types. Since MLContext creation is async, this information can be gathered at MLContext creation time by querying the backend.
This is also useful even if a context supports all types declared in WebNN spec. As it provides a programmable way for frameworks to check the supported types for all ops. thougths? @huningxin @fdwr |
Thanks @philloooo, that's very useful for current ONNX WebNN EP, with this API we don't need to maintain a whitelist for each op's data type constraints. However, in the future, we may upgrade the WebNN EP to use the same architecture as the DML EP. This would involve registering the supported operators as well as their supported data types to the onnxruntime kernel. This would allow ORT to handle the graph partitioning and help the WebNN EP gain the capability for higher-level graph optimization. If we adopt this new architecture, we would still need to maintain a whitelist. |
I don't understand this conclusion. Can you explain why or what information would be necessary to avoid the need for an allowlist? |
FYI, DML EP maintains a whitelist at OperatorRegistration.cpp, to register its supported ONNX ops, ONNX opset range, data type constraints, etc. to ORT kernel. Correct my last understand, |
Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for /cc @fs-eire |
@Honry I still don't understand what's the DML architecure. Does it not allow registering the type constraints at the time we get a context?
Or this means that the context.opSupportLimits() is compatible with the dml architecture? (So whitelist is not needed) The difference between DML backend and WebNN is that WebNN has different type limits across backends. So a single whitelist wouldn't work and we will have to get it through the context. |
The difference is DML EP has to register the data type supported list to the ORT kernel, while current WebNN EP doesn't, it only need to check if the data type of the ONNX op it delegated can be supported by WebNN op.
That's not a problem, we can get the context before registering the type constraints.
Yes, it's compatible with the DML architecture. We can use
That's really helpful for different backends. 👍 |
Can we also expose the device type info? Currently WebNN TFLite backend still has some constraints and we make workarounds for them in WebNN EP. |
Per @reillyeon's CL https://chromium-review.googlesource.com/c/chromium/src/+/5528084, if Chromium can handle preferred layout for each backend (inserting transpose from Chromium), I think we don't need to expose preferred layout info from |
We should do some measurements of how expensive the inserted transpose operators are before we make this decision. My CL will make the model work but if there's a version which matches the platform's preferred layout that might have better performance. |
Currently ORT-Web will also insert the transpose ops if the preferred layout is NHWC. |
🤔 What does "preferred" here mean? Is this the preferred tensor layout for DML, or the underlying device? FWIW, one data point from my past experience, the "preferred layout" might differ based on device (different vendor, same vendor different generation) and workload (i.e. model architecture), see threads here: xinntao/Real-ESRGAN#650 (comment) |
In my experience each backend library only supports one tensor layout but if multiple were supported then it should be device-specific so that the developer can load the appropriate model. |
Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:
TODOs:
Let me know if you have any feedback. |
Thanks @philloooo ! This is what I got from DirectML/Windows. Developers/frameworks can get platform specific supported datatypes now.
|
@inexorabletash proposed rank limits can be expressed in context's
|
😎
You know, the names of input tensor parameters didn't really matter until now (tensors inside Another realization is that multiple tensors share the same data type, like "trueValue" and "falseValue" (and it would be redundant to report them both separately). So do we return them as {"condition", "value"} data types, or as {"condition", "trueValues", "falseValues"}? 🤔 For reflection-like scenarios, using the exact name and returning one dataType list per distinct tensor might simplify tooling, but then other scenarios (like checking data type matching in ORT), returning just one dataType list for both values may be clearer (because returning two distinct lists implies the possibility that |
Thanks for the feedback! @fdwr Internally for chromium implementation we can use the same member in the mojom struct to reduce redundancy. |
Yep, I wasn't implying it was a bad change, just a new consideration.
Naming the outputs may be interesting. For simple cases that return a single output tensor, just using the name "output" makes sense (or plural "outputs" for cases like split and gru), but it's more interesting when we have an operator that returns multiple different tensors (e.g. say a dynamicQuantizeLinear existed that returned three tensors, the input quantized to int8, the scale factor, and zero-point tensor). In such cases, rather than just returning a triplet
So then (to clarify), you mean |
@fdwr I was planning on adding
Yeah that's right! |
@philloooo 🤔 Consistently including the Additionally, it's nice for diagnostic purposes if somebody wanted to print out the IO data types into a table to see all the constraints, and a number of other libraries (e.g. StableHLO, DML, ONNX) already list data type constraints for both inputs and outputs. |
Agreed - it's helpful in cases like |
Since #456 was folded into this issue, should we add |
Given this issue focuses on data type support and is closed, let's reactivate #456. |
This feedback is from @RafaelCintron (Thanks!) when reviewing Chromium CL-4828993 review.
In the current spec and implementation, if any operators/types combination are not supported by a backend, for example DirectML backend doesn't support the dilations of average pooling operator, the errors will be thrown when user code calls
MLGraphBuilder.build()
. Rafael mentioned this might be too late because user code may already download all the weights.The text was updated successfully, but these errors were encountered: