You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope to either use webgpu or emu to parallelise MSM/FFTs. I am not sure what the current implementation of MSM is, but we need to load-balance the different buckets. My target is a 2-10x speedup on a powerful GPU like GTX1060-6GB.
An idea I have which may or may not make sense is to perform the FFT in parallel on the CPU first, then when the butterflies are sufficiently small, move them into small kernels on the GPU. (can't remember which way around it is). Due to lack of data dependencies, we should be able to get very high streaming throughput.
Other problems to think about is how to package the library so that GPU can be compiled for if desired/detected. Shouldn't be too hard utilising some WebGPU/#[cfg] functionality.
The text was updated successfully, but these errors were encountered:
This would be great! I think starting with FFTs might be easier in the short term, just because it only requires field arithmetic. There is some prior work in the snarky library: o1-labs/snarky#356
I think we should add some auto-optimisation to decide how to split the problem between GPU and CPU/multicore or even distributed. Distributed is of value as it could lead to truly fast latencies - maybe 0.1s for offloading proving. This should be a separate issue. Can collab with RISELab maybe. I should look for some bayes opt in rust. I think probably something like Gaussian process optimisation might be fine. Should be relatively low-dimensional optimisation problem. As an alternative, the Simple(x) optimisation (rust crate) algorithm may be good.
I hope to either use webgpu or emu to parallelise MSM/FFTs. I am not sure what the current implementation of MSM is, but we need to load-balance the different buckets. My target is a 2-10x speedup on a powerful GPU like GTX1060-6GB.
An idea I have which may or may not make sense is to perform the FFT in parallel on the CPU first, then when the butterflies are sufficiently small, move them into small kernels on the GPU. (can't remember which way around it is). Due to lack of data dependencies, we should be able to get very high streaming throughput.
Other problems to think about is how to package the library so that GPU can be compiled for if desired/detected. Shouldn't be too hard utilising some WebGPU/
#[cfg]
functionality.The text was updated successfully, but these errors were encountered: