You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to reproduce our scaling plots from February to show the advantages of the C++ runtime for TST.
I've found our scaling is highly degraded for 1000 1ms independent tasks.
On a RTX node, we only have 4.5x speedup on 8 threads.
On the SKX node, we get 5.9x speedup (down from 7.4x speedup in Feb.)
Disabling contexts gets us back to 6.57x speedup.
Using a single C++ call for cleanup instead of 2 (PARLA_ENABLE_PYTHON_RUNAHEAD=false), gets us to 6.9x speedup.
I'm not yet sure where the remaining missing time is. If its on the python side, possibly in creating device requirements?
The text was updated successfully, but these errors were encountered:
We need to make our benchmarks easy to reproduce and automated again. (e.g. fix the google benchmark scripts) and to check them before any feature merges.
no data
3 yep this was cpu only throughput testing
I'll handle this since its really most of my code that hurt it. I need to move parts of contexts to c++ and decrease the amount of new dictionary allocations on the python side.
I've been trying to reproduce our scaling plots from February to show the advantages of the C++ runtime for TST.
I've found our scaling is highly degraded for 1000 1ms independent tasks.
On a RTX node, we only have 4.5x speedup on 8 threads.
On the SKX node, we get 5.9x speedup (down from 7.4x speedup in Feb.)
Disabling contexts gets us back to 6.57x speedup.
Using a single C++ call for cleanup instead of 2 (PARLA_ENABLE_PYTHON_RUNAHEAD=false), gets us to 6.9x speedup.
I'm not yet sure where the remaining missing time is. If its on the python side, possibly in creating device requirements?
The text was updated successfully, but these errors were encountered: