Run Intel VTune against `SlowMemcpyWorkload` #6

JackKelly · 2023-10-03T15:14:18Z

zref: #5

JackKelly · 2023-10-03T18:38:37Z

Installing Intel VTune on Ubuntu:

sudo apt install pkg-config
Install VTune using apt following these instructions.
Set all these to 0 (I used sudo emacs <filename>):
4. /proc/sys/kernel/yama/ptrace_scope
5. /proc/sys/kernel/perf_event_paranoid (from here)
6. /proc/sys/kernel/kptr_restrict
Follow Intel's post-install instructions:
8. source /opt/intel/oneapi/vtune/latest/env/vars.sh
9. vtune-self-checker.sh
vtune-gui

Running `zarr-benchmark` within VTune

I created a very simple shell script which activates the Python venv and runs the benchmark:

#!/bin/bash

# Activate venv:
source /home/jack/python_venvs/perfcapture/bin/activate

# Run zarr-benchmark:
/home/jack/python_venvs/perfcapture/bin/python \
    /home/jack/dev/zarr/perfcapture/scripts/cli.py \
    --data-path /home/jack/temp/perfcapture_data_path \
    --recipe-path /home/jack/dev/zarr/zarr-benchmark/recipes

Then run that shell script from VTune.

JackKelly · 2023-10-03T18:59:59Z

What does this benchmark do?

Here's the code. It's very simple!

Results

SlowMemcpyWorkload takes 5.4 seconds to run on my machine. Here's the VTune Hotspots Summary:

VTune shows that LZ4 decompression takes the most time (not surprising). What is perhaps more surprising - and is consistent with Vincent's observations - is that the second longest running function is __memmove_avx_unaligned_erms (ERMS is a CPUID feature which means "Enhanced REP MOVSB" (source). I think clear_page_erms is this 4-line ASM function.):

("CPI rate" is "Cycles Per Instruction retired". Smaller is better. The best a modern CPU can do is about 0.25.)

Microarchitecture Exploration

(This type of profiling slows things down quite a lot)

Memory usage

The bottom sub-plot in the Figure below is the memory bandwidth (y-axis, in GB/sec) over time. It's interesting that the code rarely maxes out the memory bandwidth (although I think VTune is slowing the code down quite a lot, here):

JackKelly · 2023-10-03T20:35:38Z

In last week's meeting, we discussed the hypothesis that the fact that the code is using a memmove function with "unaligned" in the name meant that the data was unaligned in memory, and hence the system was forced to use a slow "unaligned" memmove function. The hypothesis was that we could speed things up by aligning the data in memory.

After reading more about memmove, I no longer thing this hypothesis is correct. My understanding is that these memmove_unaligned functions do spend the majority of their time moving data very efficiently using SIMD instructions. The "unaligned" word in the name just means that the function handles the "ragged ends" and the start and/or end. But, once those "ragged ends" are handled, the function powers through the bytes very quickly using aligned SIMD.

Lots of good info here: https://squadrick.dev/journal/going-faster-than-memcpy.html

So I can't immediately see any quick wins for Zarr Python. Zarr Python has to copy data from the uncompressed chunk buffer into the final array. I'm not sure Dask will help but I'll benchmark dask too.

In a low level compiled language we could use multiple threads, one per cpu core, to copy uncompressed chunks into the final array while the uncompressed chunk is still in cpu cache after decompression.

JackKelly self-assigned this Oct 3, 2023

JackKelly added this to Zarr-Python - Benchmarking and Performance Oct 3, 2023

JackKelly moved this to Todo in Zarr-Python - Benchmarking and Performance Oct 3, 2023

JackKelly moved this from Todo to In Progress in Zarr-Python - Benchmarking and Performance Oct 3, 2023

JackKelly mentioned this issue Oct 3, 2023

Re-implement SlowMemcpyWorkload & SlowMemcpyDataset & rename #8

Open

JackKelly closed this as completed Oct 16, 2023

github-project-automation bot moved this from In Progress to Done in Zarr-Python - Benchmarking and Performance Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Intel VTune against `SlowMemcpyWorkload` #6

Run Intel VTune against `SlowMemcpyWorkload` #6

JackKelly commented Oct 3, 2023 •

edited

Loading

JackKelly commented Oct 3, 2023 •

edited

Loading

JackKelly commented Oct 3, 2023 •

edited

Loading

JackKelly commented Oct 3, 2023 •

edited

Loading

Run Intel VTune against SlowMemcpyWorkload #6

Run Intel VTune against SlowMemcpyWorkload #6

Comments

JackKelly commented Oct 3, 2023 • edited Loading

JackKelly commented Oct 3, 2023 • edited Loading

Installing Intel VTune on Ubuntu:

Running zarr-benchmark within VTune

JackKelly commented Oct 3, 2023 • edited Loading

What does this benchmark do?

Results

Microarchitecture Exploration

Memory usage

JackKelly commented Oct 3, 2023 • edited Loading

Run Intel VTune against `SlowMemcpyWorkload` #6

Run Intel VTune against `SlowMemcpyWorkload` #6

JackKelly commented Oct 3, 2023 •

edited

Loading

JackKelly commented Oct 3, 2023 •

edited

Loading

Running `zarr-benchmark` within VTune

JackKelly commented Oct 3, 2023 •

edited

Loading

JackKelly commented Oct 3, 2023 •

edited

Loading