You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am testing out the timing of different compute kernels. I use the same timing method as in the compute example to get the compute pass time. I also do a simple std::time::Instant timer from before the .commit() to after the .wait_until_completed() and my CPU timer ends up being around 12x longer than the CPU timer. There really shouldn't be any copying between CPU and GPU here, so the only thing I can think of is waiting to dispatch the kernel, but I can't imagine it takes 14ms!
GPU timestamps with sampleTimestamps are complicated in Metal. They're not actually nanoseconds so it's up to the application to try to correlation GPU timing with CPU timing somehow and get back to some unit of time that makes sense.
@grovesNL Thanks for the links. Is it normal to take 6ms to complete a kernel that does nothing? I think I might not be setting something up correctly. When I comment out all the code in my kernel, it still takes 6ms to run.
It's hard to say what's going on here for sure without profiling, I'd look at what's happening in a profiler like Metal System Trace or the Xcode profiler. You could try moving all of the buffer creation/copies/other setup outside of run to see if you can tell where the overhead is coming from.
There is definitely non-zero overhead to a GPU dispatch followed by a readback in general (e.g., a lot of programs try to avoid waiting on a GPU read, recognizing all GPU work as completing asynchronously instead), but hard for me to guess what that overhead might be on your system without eliminating everything else here.
Hi, I am testing out the timing of different compute kernels. I use the same timing method as in the compute example to get the compute pass time. I also do a simple std::time::Instant timer from before the .commit() to after the .wait_until_completed() and my CPU timer ends up being around 12x longer than the CPU timer. There really shouldn't be any copying between CPU and GPU here, so the only thing I can think of is waiting to dispatch the kernel, but I can't imagine it takes 14ms!
Here is my entire reproducable example:
And my output:
The text was updated successfully, but these errors were encountered: