-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Intel VTune against SlowMemcpyWorkload
#6
Comments
Installing Intel VTune on Ubuntu:
Running
|
What does this benchmark do?Here's the code. It's very simple! Results
VTune shows that LZ4 decompression takes the most time (not surprising). What is perhaps more surprising - and is consistent with Vincent's observations - is that the second longest running function is ("CPI rate" is "Cycles Per Instruction retired". Smaller is better. The best a modern CPU can do is about 0.25.) Microarchitecture Exploration(This type of profiling slows things down quite a lot) Memory usageThe bottom sub-plot in the Figure below is the memory bandwidth (y-axis, in GB/sec) over time. It's interesting that the code rarely maxes out the memory bandwidth (although I think VTune is slowing the code down quite a lot, here): |
In last week's meeting, we discussed the hypothesis that the fact that the code is using a memmove function with "unaligned" in the name meant that the data was unaligned in memory, and hence the system was forced to use a slow "unaligned" memmove function. The hypothesis was that we could speed things up by aligning the data in memory. After reading more about memmove, I no longer thing this hypothesis is correct. My understanding is that these memmove_unaligned functions do spend the majority of their time moving data very efficiently using SIMD instructions. The "unaligned" word in the name just means that the function handles the "ragged ends" and the start and/or end. But, once those "ragged ends" are handled, the function powers through the bytes very quickly using aligned SIMD. Lots of good info here: https://squadrick.dev/journal/going-faster-than-memcpy.html So I can't immediately see any quick wins for Zarr Python. Zarr Python has to copy data from the uncompressed chunk buffer into the final array. I'm not sure Dask will help but I'll benchmark dask too. In a low level compiled language we could use multiple threads, one per cpu core, to copy uncompressed chunks into the final array while the uncompressed chunk is still in cpu cache after decompression. |
zref: #5
The text was updated successfully, but these errors were encountered: