-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with real-time video processing #226
Comments
Thanks @dontcallmedom for starting this discussion that clearly benefits from coordination between the WebML and WebRTC WGs. A prototype integrating the WebNN API with the mediacapture-transform API (explainer, Chrome Status, crbug) would be an interesting exploration. I think we'll first need to look at the mediacapture-transform API in more detail in this group to understand it better. It would be an informative exercise to evaluate how the integration with mediacapture-transform API in a worker context affects the performance over what we currently experience. Looking at our existing work, for background blur, we have made available a WebNN Semantic Segmentation sample (source) that uses DeepLab V3 MobileNet V2 from TFLite models. This sample uses webnn-polyfill that is based on TensorFlow.js. With the polyfill the sample performs OK-ish, but we expect substantial improvement in performance with the native implementation. One possibility would be to expand this sample, or build upon the WebRTC samples @dontcallmedom referenced. @huningxin can possibly share the expected speedup when using WebNN-GPU or WebNN-CPU backends for semantic segmentation using the above-mentioned model compared to the polyfill and its CPU/Wasm and GPU/WebGL backends. Currently, we have a test bench based on Electron.js we can use to approximate the performance of a native browser implementation. We shared some performance data using this test bench earlier for other use cases. There may be some gaps in API coverage in Electron.js to prototype this, needs investigation. We're starting upstreaming WebNN to Chromium which should make it easier for us to identify implementation optimization opportunities and issues, but that work takes a while to land. [Edit: updated mediacapture-transform to point to the TR version.] |
+1 /cc @ibelem
As TPAC WebNN demo video (semantic segmentation demo starts from 1:17, the performance summary starts from 2:00) shared by @Honry, there were 3.4x speedup on GPU and 7.2x speedup on CPU for DeepLab V3 MobileNet V2 model on the test device. I am curious whether real-time audio processing is in the scope. Noise suppression use case is supported by WebNN, e.g., for video conferencing application. WebNN sample supports two noise suppression models: RNNoise and NSNet2. |
mediacapture-transform API is based on WHATWG Streams, which enables media processing to be represented as a TransformStream. In this model, Audio/VideoFrames (raw samples) are provided as input, and the output is enqueued for processing by the next stage. A presentation on the model and some sample applications are provided here. |
Thanks for the presentation @aboba, highly informative. Any caveats in the TransformStream or mediacapture-transform implementation we should be aware of for this experiment? Can MediaStreamTracks be transferred yet? Per the demo @huningxin pointed us to, we see ~26 fps background blur using semantic segmentation (internals: Electron.js with WebNN Node.js binding, GPU backend, Core i7 & integrated graphics). We'll look into experimenting with background blur in a transform stream in a worker context. Any specific metrics you're interested in, let us know. |
As far as I know, transferrable MediaStreamTracks are not implemented yet. So for experimentation, you'd need to use transferrable streams. The presentation included a demo of a mediapipeline implemented entirely in workers using transferrable streams. So you can just define another TransformStream and plug it into the pipeline. Wrapping encode/decode and serialize/deserialize in a TransformStream was pretty straightforward, though I have come to understand that additional work is needed to do a good job cleaning up after errors. |
@aboba, thanks for the mediapipeline in workers demo and for confirming my hunch we need some workarounds to transfer MTS across. I also want to acknowledge @tomayac who wrote an article that includes a pointer to a QR code scanner demo, and @dontcallmedom for pointing us to the WebRTC Samples repo initially. Would folks prefer us to take some specific sample e.g. insertable-streams/video-processing as a starting point, move processing to worker and add a neural network accelerated background blur transform option to it? We could also use our semantic segmentation sample and bolt TransformStream in worker into it. I'm trying to define this prototyping task so it could become more than just a one-off experiment, something that folks interested in this could use to build more experiments with. I think it'd be good to host this (also) in the canonical WebRTC samples repo, if there's one. WebRTC use cases are important to our group working on WebNN API. |
We have implemented a real-time sample of noise suppression (RNNoise) based on mediacapture-transform API. This sample successfully constructed a pipeline to suppress the noise of the audio data collected by the microphone and send it to the speaker or earphone of the device. |
great to hear! If I may, I would suggest focusing rather than on video processing if at all possible, for two reasons:
|
I've integrated mediacapture-transform API for video processing in our semantic segmentation sample. I just reused the original post-processing, which is a bit complicated that calculates segmentation map for different detected objects, renders outputs into a canvas element and provides features for filling in customized colors, images and etc., then I converted the output canvas to a This is the cheapest way to integrate the mediacapture-transform API into current webnn-samples but not efficient. We may need to figure out a new post-processing that suitable for this API to improve the performance. Source code: https://github.com/Honry/webnn-samples/tree/mediacapture-transform/semantic_segmentation |
Thanks @Honry and @miaobin for your work on these prototypes. @Honry, it seems your mediacapture-transform API prototype performance is on par with the original semantic segmentation sample used as a starting point. I think that was expected. I think the next step would be to move expensive processing to workers, and as of now, that requires the use of transferrable streams, I believe. This task will be challenging due to limited availability of these work-in-progress APIs in Chromium, so the prototype may need to be revised as the browser implementation of mediacapture-transform API evolves. @aboba shared some tips above that may be helpful. The WG will review discuss these prototypes on our next call. Thank you for your contributions, this is important prototyping work. |
thanks indeed, good progress! My reading of the code shows that there is at least a GPU→CPU transfer when turning the video frame into an input tensor; I'm not sure if the model inference is happening on the CPU or GPU. Ideally, and notwithstanding @anssiko's remarks about making this running in a worker, we would want to write a full-GPU-only pipeline, if at all possible with no memory copy. Can you already identify gaps in the APIs that would make this hard or impossible? |
FWIW, there is a worker sample here: https://webrtc.github.io/samples/src/content/insertable-streams/video-crop/ The API used is not the API that ended up being standardized, but no browser supports the standardized API yet. The APIs used by the sample are stable in Chromium. |
According to WebNN spec, an And there are corresponding import video frame to GPU texture extension/proposal: WebGL WEBGL_webcodecs_video_frame Extension and import VideoFrame from WebCodec to WebGPU proposal. So it looks like possible that the app can avoid the GPU-CPU transfer by importing the video frame into a GPU texture and feed it into WebNN graph which is created from the same GPU device. |
thanks @huningxin for the pointers! So assuming I'm also not clear how we would go from the resulting GPU buffers Separately, it seems that all the machinery I have identified as needed so far in WebGPU/WebNN would be available in the worker in which MediaCapture Transform would operate, assuming there would be no particular challenge in having the ML model data available in that context. But do we have a mechanism to load the ML model directly in a GPU buffer, or does this necessarily go through some CPU copy first? |
Both transferable streams and MediaStreamTrackGenerator / MediaStreamTrackProcessor are released APIs in Chrome (transferable streams were released in Chrome M87 per https://chromestatus.com/feature/5298733486964736) The version that the WG is currently iterating on for MediaStreamTrackGenerator is slightly different from what Chrome implements (among other things, it's called VideoTrackGenerator), but the differences in semantics are very small so far. |
If the ML graph supports using a I'm surprised that the interaction with WebGPU is already integrated in WebNN without any discussions with the WebGPU group, and it's not immediately clear if passing a |
In WebCodecs we have PR w3c/webcodecs#412 for conversion of |
Good suggestion. gpuweb/gpuweb#2500 opened. |
The sample probably could use WebGPU shader to convert the
The sample today uses JS to render (
I proposed above GPU pipeline processing steps in webmachinelearning/webnn-samples#124. |
cc @sandersdan |
The on-ramp and off-ramp from |
another question that the usage of RGB-based canvas in the current version of the prototype raises: I'm unclear in the first place who decides what format/space is used in a |
When constructing from an The Chrome capture pipeline is optimized for WebRTC use which means a preference for I420 or NV12 depending on the platform. Capture from canvas is similar, an RGB format is logical but if we expect to be encoding then it can be more efficient to produce I420 or NV12 directly at capture time. In practice we usually have an RGBA texture from canvas capture and defer pixel format conversion to encode time. |
@sandersdan when the underlying pixel format and color space was a purely internal matter for optimization by the UA, leaving this entirely under the UA control made sense; but does it remain workable once we start exposing these internal aspects to developers? |
As I see it, apps that want to be efficient should support multiple formats, and all apps should use resampling for formats they do not support. It is rarely going to be more efficient to do the resampling at capture time, and in many cases it would prevent the UA from doing the most efficient thing. Sometimes we don't even know what the underlying format is, such as when it is provided by the platform as an external texture. In that case the only option is to sample it, and deferring that is likely to be more efficient. We could work more on the import ergonomics. Sometimes we do have a backing that could directly become a Buffer, other times we could hide the sampling and provide a Buffer in a requested format. This is moot with just |
(FYI: https://www.w3.org/TR/mediacapture-transform/ was released as a FPWD today, so we have a canonical URL for this spec now. Congrats WebRTC WG!) |
I created a background blur sample based on the video processing sample of insertable streams: main thread version: https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing/ Currently it supports two transforms (hopefully full-GPU-only processing pipeline): WebGL segmentation and blurThe details of the WebGL processing pipeline (webgl-background-blur.js):
WebNN segmentation and WebGPU blurThe details of the WebGPU/WebNN processing pipeline (webgpu-background-blur.js):
To test the WebGPU/WebNN processing pipeline, you may need to download the WebNN Chromium prototype, currently only Windows build is available. This prototype supports DirectML backend and implements WebNN/WebGPU interop API that accepts GPUBuffer as WebNN graph constants and inputs/outputs. Other notes (known issues):
The screenshot of the WebGPU/WebNN transform running in the WebNN Chromium prototype: |
Thanks @huningxin for your continued work on this topic, impressive proof of concept and the WebNN Chromium prototype. I've put this topic on the WG agenda for this week. |
Thanks @huningxin for another amazing piece of work! Beyond the implementation limitations you noted, are there new API lessons that have emerged from this new round of prototyping? Trying it out, some (probably very unreliable) measurements on my laptop:
I haven't investigated either of these aspects in any depth - the CPU usage may be linked to running this in the main thread rather than in a worker, creating contention? |
The worker version is now available at: https://huningxin.github.io/webrtc-samples/src/content/insertable-streams/video-processing-worker/ Feel free to test it on your laptop.
AFAIK, running the transform in worker would not reduce the CPU usage. It just offloads the workload off the main thread with some overhead of inter-thread-communication. I suppose it would help free the main/UI thread if the transform is blocking call, e.g., Wasm function (TF.js wasm backend?) or the sync version of WebNN graph compute (#229). |
This issue has been fixed. The updated WebNN prototype based on Chromium 102.0.4973.0 includes this fix. Thanks much @shaoboyan and @Kangz ! |
According to my initial profiling, the transform loop spends about 35% of the total time on I'll look into whether |
To keep everyone on top: this background blur prototype was reviewed with the WebRTC WG: See the minutes for the next steps. |
According to the resolution 2 of WebML WG Teleconference – 16 June 2022, I am going to remove the "cr" label of this issue. This use case depends on WebNN / WebGPU interop capability. #257 introduced the |
FYI @tidoust wrote up some of his extensive research in nearby spaces https://webrtchacks.com/real-time-video-processing-with-webcodecs-and-streams-processing-pipelines-part-1/ https://webrtchacks.com/video-frame-processing-on-the-web-webassembly-webgpu-webgl-webcodecs-webnn-and-webtransport/ - the latter mentions WebNN explicitly |
I was about to link these two fantastic articles here but @dontcallmedom beat me to it. Great work @tidoust and @dontcallmedom! I love the game of dominoes analog. From now on it is my mental model for the video processing pipeline :-) |
@dontcallmedom, is it correct that the WebRTC WG has been actively working on https://w3c.github.io/webrtc-encoded-transform/ as a (functionally equivalent?) replacement for the earlier proposal https://alvestrand.github.io/mediacapture-transform/ ? Per https://chromestatus.com/feature/5499415634640896 the earlier proposal shipped in Chrome. I wanted to document the most recent WebRTC WG's direction here should the WebML WG aspire to do further work in this space in the future. |
mediacapture-transform (shipping in chrome, https://w3c/mediacapture-transform is what's been agreed upon but not implemented) is about raw buffers. |
@alvestrand thanks for the clarification. I had forgotten mediacapture-transform had transitioned (no pun intended) from an Unofficial Proposal Draft https://alvestrand.github.io/mediacapture-transform/ to its own Working Draft https://w3c.github.io/mediacapture-transform/ Feel free to ping this issue when you think it'd be a good time for the WebML WG to review the latest API again. |
The WebRTC Working Group is working on an API to allow fast processing of real-time video, with two proposals under discussion towards convergence: https://github.com/alvestrand/mediacapture-transform and https://github.com/jan-ivar/mediacapture-transform (see also relevant issues on convergence). Chromium has started shipping an implementation based on the first proposal which should allow for initial experimentation with the overall approach.
Since we can expect a lot of this real-time processing to be done based on Machine Learning models, and as suggested by the Web Machine Learning Working Group charter, we should ensure that models loaded via WebNN-backed JS frameworks can be used in the context of that API (in particular, of a WHATWG Streams-based API, running in a worker context, with video frames coming from a webcam likely stored in a GPU memory context), and that it delivers actual performance improvements (in particular that any boost from the hardware acceleration provided by WebNN doesn't get overtaken by cost associated with e.g. memory copies).
My sense is that the best way to determine this would be:
While the real-time video processing framework in WebRTC is still somewhat in flux, I think we have enough convergence on the overall picture and a good enough basis for experimentation with the Chromium implementation to get started with such a work. The WebRTC Samples repo has a few examples of that API in action (
video-crop
in particular exercises it in a worker context)./cc @aboba @alvestrand @jan-ivar
The text was updated successfully, but these errors were encountered: