-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WASM support #164
Comments
Hi, thanks for the interest in contributing! I've looked at Candle and Burn a little, but not in a ton of depth. From what I've seen, the code is the model and then you can possibly store weights separately (vs something like TorchScript where the code defines a model structure that can be exported and stored with the weights). Is that correct? Do either of them have a serializable and loadable format that contains both weights and model structure? In some sense, I guess an executable is that, but the downside is that it might not be super portable. I guess my question is do both packages allow you to compile everything down or force you to compile everything down? WASM RunnersI think a runner that can run WASM code could be really interesting. That's something I've thought about a little bit, but haven't fully explored. If we build something that supports WebGPU (maybe through wgpu) as well, that could give us a cross-platform runner for these artifacts that supports inference on CPU and GPU. I'm not sure how the details of that would work, but that certainly seems appealing. On the packaging side, we'd probably define a set of functions that the WASM module has to implement (similar to how we handle arbitrary Python code) and then any program that can compile down to WASM and implement that interface will work with this runner. This seems pretty promising to me. Would you be interested in exploring/helping to implement it? |
Great insights. I can only speak for candle but you are right, the code is not detachable from the model, and you have to compile everything. I don't think there is an idiomatic way to support models in the form of code + weights it would involve some sort of JIT. Would definitely make more sense to support their compilation targets instead.
Definitely agree. We can define standard APIs for various languages that have wasm targets, which would allow us to create runners for not only wasm but normal binaries as well. We can use the |
Sounds good! A few more thoughts: WebGPUAs far as I know, there currently isn't a standard way of exposing WebGPU to Wasm. It looks like many projects that use WebGPU from WASM have JS shims that expose the relevant browser functionality as custom APIs to Wasm that the Wasm code knows how to use. We could also expose WebGPU as an API to Wasm ourselves, but I don't think that provides much value because users would have to explicitly design and build against that API (vs it being something that just works). So this might be difficult to do until there's a standard way to use GPUs from Wasm. As a workaround, we could implement the API that wgpu expects when running from Wasm. Not sure if that's the best use of time at the moment though. Wasm vs NativeI think it's worth exploring Wasm vs Native (on CPU) performance for some popular models. Are they fast enough vs native alternatives where the portability outweighs the lower performance? That's to say, would people actually want to ship a model in Wasm if they aren't just targeting browsers? Native binariesI think this requires us to be quite thoughtful about design:
For native binaries, I think we likely don't want to do the above. I think it could turn into a lot of work for questionable gain. A good balance might be something like "you can package up a Now that I'm thinking more about it, that approach could work for Wasm too. We'd also have to spend more time thinking about security and portability. You can already limit a model to specific platforms so that isn't a huge deal, but native dependencies could make models frustrating to use (e.g. if people don't ship fully statically linked binaries, require a particular system configuration, etc). Wasm avoids a lot of that. You could solve a lot of the UX problems I just described by letting people package up a Docker/OCI image, but that introduces some other problems. For example, if Carton is already being run in a container, we'd need to use nested containers or talk to the Docker daemon on the host. Both of these have a lot of issues. This is also a Linux only solution (which might be okay, but it's worth noting). My TODO list also includes tasks around better model isolation from the system so I think we'd want to ship those before making it easy for people to run arbitrary native code. Technically many ML models are just programs (see the TensorFlow security doc) and we already allow arbitrary Python code (which can include native code and PyPi packages), but I think it's important to have a well-thought-through security model before we let people directly package up an arbitrary Summary
ProposalMaybe a good approach moving forward is :
We can explore all the rest later (WebGPU, native binaries, etc). Thoughts? |
Appreciate the insights! My initial suggestions for language specific APIs was motivated by the difficulty of designing an API that is idiomatic to implement in multiple languages (especially Rust). But yeah, the responsibility of that abstraction could lie in a different module if needed. I almost got a super simple CPU only runner working.
As for performance, obviously native is going to be faster than wasm, since you will be able to leverage hardware specific intrinsics and kernels like mkl, cublas, accelerate, etc. The selling point for wasm/wasi, as you mentioned, is the portability. If your product is a model as a service, you would't really care. But if your product ships with the model embedded, like say an app or a game, you could use wasm as a fallback. |
That's great! Looking forward to the PR
Cool :) Not that we're going to implement WebGPU support in a runner right now, but just a note: it looks like the code/project you linked to doesn't have a license yet so be wary of copying from it.
Yes, you're correct. However, if Wasm is 10x slower than native for your model, it may be tempting to ship a separate native model for each platform rather than just one Wasm one. 10x is almost certainly an exaggeration, but it would be good to know what the actual number is for a few popular models. For example, if it's only 10% slower, the portability and convenience likely outweighs the performance difference for many use cases.
To make sure I understand correctly; are you saying an app or game that supports platforms That primarily applies in the context of models that can compile to both Wasm and Native code right? Which brings us back to Burn and Candle I guess. |
Yeah, for example you could have CUDA, Vulkan, etc, and a wasm one, such that if the end user does not have the matching hardware deps the app would still work. In such a case wasm's speed wouldn't matter as long as it's somewhat usable. You should try some of the candle browser examples like LLAMA2, which is surprisingly performant even without WebGPU. Putting that in a game is definitely feasible. I definitely plan on doing benchmarks though, since I'm on Mac I'll do wasm vs native vs native + accelerate. My initial prediction is that wasm and vanilla native will be quite similar. I think thanks to those libraries, you may start see apps that offload some computation off to the user's device, to save on inference cost, preserve privacy, enable offline usage, etc. I think Carton could be super helpful to make the development of such apps more idiomatic and modular. |
This PR adds a WASM runner, which can run WASM models compiled using the interface (subject to change #175) defined in ```../carton-runner-wasm/wit/lib.wit```. The existing implementation is still unoptimized, requiring 2 copies per Tensor moved to/from WASM. An example of compiling a compatible model can be found in ```carton-runner-wasm/tests/test_model```. ## Limitations - Only the ```wasm32-unknown-unknown``` target has been tested to be working. - Only ```infer``` is supported for now. - Packing only supports a single ```.wasm``` file and no other artifacts. - No WebGPU, and probably not for a while. ## Test Coverage All type conversions from Carton to WASM and vice versa and fully covered. Pack, Load, Infer are covered in pack.rs. ## TODOs Track in #164
Post #173
I think the TODOs
|
@VivekPanyam I'm going to start implementing |
Yeah, these methods aren't super well documented yet. Here's a little snippet from the code for the public interface: carton/source/carton/src/carton.rs Lines 102 to 105 in bc46639
carton/source/carton/src/carton.rs Lines 116 to 118 in bc46639
Basically, these let users write more efficient pipelines by allowing them to send tensors for inference This works well for some runners where its easy to do things in parallel (e.g. the TorchScript runner where we could in theory start moving tensors for the next inference to the correct devices while the current inference is running in another thread). Even if we're not moving tensors between devices, we could at the very least parallelize the type conversions (which could matter if they involve copies of somewhat large tensors). Currently we don't do anything in the TorchScript runner other than storing the tensors in In the Python runner, we allow users to optionally implement Ideally, we'd do something similar for Wasm, but it might be tricky to implement in a way that actually parallelizes things because it looks like threading and thread safety aren't super mature features of Wasm and Wasmtime. So for now, I think we can add these methods to the interface that users can implement and then down the road once threading support is more mature, these methods will have more value (without breaking the interface). Some references that may be useful to explore: |
Have you checked out lunatic? Might be a workaround for the lack of threads. |
How easy would it be to add support for Rust based libraries like Candle and Burn. I'd like to implement this if you aren't already working on it. I'd also appreciate your thoughts on whether this integration is even necessary or useful, since both packages allow you to compile everything down. Maybe it would make more sense, to instead create runners for the formats those libraries can produces, like binaries, wasm, and executables.
The text was updated successfully, but these errors were encountered: