Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Intel VTune against SlowMemcpyWorkload #6

Closed
JackKelly opened this issue Oct 3, 2023 · 3 comments
Closed

Run Intel VTune against SlowMemcpyWorkload #6

JackKelly opened this issue Oct 3, 2023 · 3 comments
Assignees

Comments

@JackKelly
Copy link
Collaborator

JackKelly commented Oct 3, 2023

zref: #5

@JackKelly
Copy link
Collaborator Author

JackKelly commented Oct 3, 2023

Installing Intel VTune on Ubuntu:

  1. sudo apt install pkg-config
  2. Install VTune using apt following these instructions.
  3. Set all these to 0 (I used sudo emacs <filename>):
    4. /proc/sys/kernel/yama/ptrace_scope
    5. /proc/sys/kernel/perf_event_paranoid (from here)
    6. /proc/sys/kernel/kptr_restrict
  4. Follow Intel's post-install instructions:
    8. source /opt/intel/oneapi/vtune/latest/env/vars.sh
    9. vtune-self-checker.sh
  5. vtune-gui

Running zarr-benchmark within VTune

I created a very simple shell script which activates the Python venv and runs the benchmark:

#!/bin/bash

# Activate venv:
source /home/jack/python_venvs/perfcapture/bin/activate

# Run zarr-benchmark:
/home/jack/python_venvs/perfcapture/bin/python \
    /home/jack/dev/zarr/perfcapture/scripts/cli.py \
    --data-path /home/jack/temp/perfcapture_data_path \
    --recipe-path /home/jack/dev/zarr/zarr-benchmark/recipes

Then run that shell script from VTune.

@JackKelly
Copy link
Collaborator Author

JackKelly commented Oct 3, 2023

What does this benchmark do?

Here's the code. It's very simple!

Results

SlowMemcpyWorkload takes 5.4 seconds to run on my machine. Here's the VTune Hotspots Summary:

image

VTune shows that LZ4 decompression takes the most time (not surprising). What is perhaps more surprising - and is consistent with Vincent's observations - is that the second longest running function is __memmove_avx_unaligned_erms (ERMS is a CPUID feature which means "Enhanced REP MOVSB" (source). I think clear_page_erms is this 4-line ASM function.):

image

("CPI rate" is "Cycles Per Instruction retired". Smaller is better. The best a modern CPU can do is about 0.25.)

image

Microarchitecture Exploration

(This type of profiling slows things down quite a lot)

image

image

Memory usage

image

image

image

The bottom sub-plot in the Figure below is the memory bandwidth (y-axis, in GB/sec) over time. It's interesting that the code rarely maxes out the memory bandwidth (although I think VTune is slowing the code down quite a lot, here):

image

@JackKelly
Copy link
Collaborator Author

JackKelly commented Oct 3, 2023

In last week's meeting, we discussed the hypothesis that the fact that the code is using a memmove function with "unaligned" in the name meant that the data was unaligned in memory, and hence the system was forced to use a slow "unaligned" memmove function. The hypothesis was that we could speed things up by aligning the data in memory.

After reading more about memmove, I no longer thing this hypothesis is correct. My understanding is that these memmove_unaligned functions do spend the majority of their time moving data very efficiently using SIMD instructions. The "unaligned" word in the name just means that the function handles the "ragged ends" and the start and/or end. But, once those "ragged ends" are handled, the function powers through the bytes very quickly using aligned SIMD.

Lots of good info here: https://squadrick.dev/journal/going-faster-than-memcpy.html

So I can't immediately see any quick wins for Zarr Python. Zarr Python has to copy data from the uncompressed chunk buffer into the final array. I'm not sure Dask will help but I'll benchmark dask too.

In a low level compiled language we could use multiple threads, one per cpu core, to copy uncompressed chunks into the final array while the uncompressed chunk is still in cpu cache after decompression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant