Intel VTune results for Zarr-Python reading entire array #22

JackKelly · 2023-10-20T17:10:19Z

JackKelly
Oct 20, 2023
Maintainer

Here are some detailed performance profiles of Zar-Python, Numpy, and fio.

In all cases, we read the entirety of a 1 gigabyte dataset from local SSD. (Using my very modest and aging Intel NUC, with a PCIe v3 SSD, capable of only ~ 2 GB/sec). The 1 GB dataset is a repeated (monochrome) image of a tarte tatin 🙂.

TL;DR: These analyses add support for what we already knew about Zarr-Python:

Even when loading just a single chunk, Zarr-Python only loads data from disk at about half the available disk bandwidth (a remedy in a compiled language could be to use io_uring).
The "second memory copy" that Zarr has to do (to copy each chunk into the final numpy array) can be surprisingly costly (a remedy in a compiled language could be to copy each chunk in parallel, immediately after decompression, so the chunk is still in CPU cache).
When loading a dataset with a lot of chunks (20,000, in the example below), Zarr-Python slows down a lot (supposedly because looping in Python is slow).

Details:

The image below shows the Intel VTune "Input and Output" analysis of Zarr-Benchmark running a total of 6 different jobs. Each job is run 3 times (so we can see the variation between runs). The 6 jobs and dataset are defined here. In summary, they are:

Numpy NPY: Use numpy.load to read an entire 1 GB .npy file into RAM.
Zarr Uncompressed 1 chunk: Use Zarr-Python to read a single 1 GB chunk.
Zarr LZ4 200 chunks: Use Zarr-Python to read a 1 GB dataset, spread across 200 chunks, and compressed using LZ4. Each chunk is 4.7 MB on disk.
Zarr Uncompressed 200 chunks. Each chunk is 5 MB.
Zarr LZ4 20,000 chunks. Each chunk is about 7.4 kB.
Zarr Uncompressed 20,000 chunks. Each chunk is 50 kB.

The rows in the VTune screenshot show:

python: The CPU usage (on a single CPU core) for Python
amplxe-runss: CPU usage for Intel VTune
Thread (TID: 0):
DRAM bandwidth: Dark blue is write; light blue is read. The y-axis is in GB/sec.
I/O queue depth.
Page faults. This isn't due to swapping virtual memory to disk. Instead, I think this is due to Zarr-Python writing into a region of RAM that the OS has "lazily" allocated to that process. (When a process asks the OS for more RAM, the OS will often tell the process that the OS has successfully allocated that region in RAM, but actually the OS will only physically allocate that region in RAM when the process first writes into that region of RAM. Page faults are triggered when a process tries to access a region of RAM that isn't physically in RAM.)
I/O operations on the SSD
I/O data transfer on the DDF.
CPU activity.

JackKelly · 2023-10-20T18:01:39Z

JackKelly
Oct 20, 2023
Maintainer Author

IO bandwidth

For the I/O data transfer bandwidth in the plot above, the first 4 jobs peak at only about 620 MB/s (about half the bandwidth that the drive is capable of).

In contrast, if we use fio (an IO benchmarking tool), we can get up to 2 GB/sec on this same SSD. Here are some VTune plots of two fio runs. The first fio run ("fio psync") uses an old but very common IO interface (possibly the same interface that Python is using): psync. The second fio run ("fio io_uring") uses io_uring, which is a high-performance async zero-copy Linux storage API, introduced in 2019.

The full fio commands are:

For fio psync: fio --name=read --size=1g --direct=0 --bs=1g --rw=read --ioengine=psync --iodepth=1 --directory=/home/jack/temp/fio
For fio io_uring: fio --name=read --size=1g --direct=1 --bs=128k --rw=read --ioengine=io_uring --iodepth=32 --directory=/home/jack/temp/fio

This shows that, even when reading just a single file (as each fio run does), io_uring's zero-copy approach still shows a lot of benefits. io_uring shares a single memory buffer between the kernel and user space, so the only copy that happens is copying data from the SSD into RAM. In contrast, fio psync first copies data from the SSD into kernel page buffers, and then copies from the kernel page buffers to the user-space process. You can see this in the "DRAM bandwidth" plot: fio psync shows significantly higher memory read bandwidth (supposedly copying data from ther Kernel's page buffer).

0 replies

JackKelly · 2023-10-20T18:09:02Z

JackKelly
Oct 20, 2023
Maintainer Author

If anyone's interested, here are the details performance metrics output by perfcapture:

Running NumpyLoadEntireArray 3 times on NumpyNPY!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                   mean         std
Runtime in secs                1.844997    0.092292
GB/sec to numpy                0.542888    0.026456
read_count                  7711.000000    0.000000
write_count                   31.000000   51.971146
read_merged_count              0.000000    0.000000
write_merged_count            12.000000   20.784610
read_IOPS                   4186.212512  204.000671
write_IOPS                    15.924820   26.615537
avg read GB/sec                0.542889    0.026456
avg write GB/sec               0.001728    0.002988
read GB / read_time_secs       0.343125    0.020141
write GB / write_time_secs     0.008232    0.014258
read GB                        1.000002    0.000000
write GB                       0.003371    0.005828
read_time_secs                 2.921333    0.177427
write_time_secs                0.136333    0.236136
busy_time_secs                 1.842667    0.088574

Running ZarrPythonLoadEntireArray 3 times on Uncompressed_1_Chunk!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                   mean         std
Runtime in secs                2.092493    0.567850
GB/sec to numpy                0.499084    0.117099
read_count                  7631.000000    0.000000
write_count                  175.666667  302.533194
read_merged_count              0.000000    0.000000
write_merged_count             8.000000    6.928203
read_IOPS                   3808.508684  893.581958
write_IOPS                    64.057335  109.968224
avg read GB/sec                0.499087    0.117100
avg write GB/sec               0.007862    0.013589
read GB / read_time_secs       0.375829    0.116292
write GB / write_time_secs     0.008192    0.007142
read GB                        1.000006    0.000000
write GB                       0.021595    0.037355
read_time_secs                 2.885000    1.086863
write_time_secs                1.647667    2.849513
busy_time_secs                 1.866667    0.565817

Running ZarrPythonLoadEntireArray 3 times on LZ4_200_Chunks!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                   mean        std
Runtime in secs                1.811138   0.009037
GB/sec to numpy                0.552148   0.002752
read_count                  7260.000000   0.000000
write_count                    3.666667   6.350853
read_merged_count              0.000000   0.000000
write_merged_count             4.333333   7.505553
read_IOPS                   4008.595903  19.978309
write_IOPS                     2.026201   3.509484
avg read GB/sec                0.520769   0.002595
avg write GB/sec               0.000018   0.000031
read GB / read_time_secs       0.464770   0.001261
write GB / write_time_secs     0.006554   0.011351
read GB                        0.943170   0.000000
write GB                       0.000033   0.000057
read_time_secs                 2.029333   0.005508
write_time_secs                0.001667   0.002887
busy_time_secs                 1.428000   0.008000

Running ZarrPythonLoadEntireArray 3 times on Uncompressed_200_Chunks!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                   mean       std
Runtime in secs                1.796072  0.002526
GB/sec to numpy                0.556771  0.000783
read_count                  7801.000000  0.000000
write_count                    4.666667  6.429101
read_merged_count              0.000000  0.000000
write_merged_count             9.333333  8.144528
read_IOPS                   4343.372000  6.105328
write_IOPS                     2.598832  3.581886
avg read GB/sec                0.556909  0.000783
avg write GB/sec               0.000032  0.000031
read GB / read_time_secs       0.458269  0.000641
write GB / write_time_secs     0.010240  0.009385
read GB                        1.000247  0.000000
write GB                       0.000057  0.000055
read_time_secs                 2.182667  0.003055
write_time_secs                0.003667  0.003215
busy_time_secs                 1.510667  0.004619

Running ZarrPythonLoadEntireArray 3 times on LZ4_20000_Chunks!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                    mean          std
Runtime in secs                 4.684289     3.042442
GB/sec to numpy                 0.268371     0.126772
read_count                  20002.333333     2.309401
write_count                  1097.333333  1879.028029
read_merged_count               0.000000     0.000000
write_merged_count             16.666667    24.664414
read_IOPS                    5367.859722  2535.286658
write_IOPS                    135.687466   227.678042
avg read GB/sec                 0.044008     0.020787
avg write GB/sec                0.017217     0.029257
read GB / read_time_secs        0.258424     0.198226
write GB / write_time_secs      0.043531     0.072988
read GB                         0.163985     0.000009
write GB                        0.139989     0.240810
read_time_secs                  2.136000     2.937569
write_time_secs                49.811333    86.262770
busy_time_secs                  3.985333     3.043793

Running ZarrPythonLoadEntireArray 3 times on Uncompressed_20000_Chunks!
dataset_partition_name = nvme0n1p5
Run 1 of 3...
Run 2 of 3...
Run 3 of 3...
Finished!
                                    mean        std
Runtime in secs                 4.775365   0.041801
GB/sec to numpy                 0.209419   0.001824
read_count                  20001.333333   0.577350
write_count                    21.000000  22.605309
read_merged_count               0.000000   0.000000
write_merged_count             23.666667   8.020806
read_IOPS                    4188.652650  36.362522
write_IOPS                      4.371883   4.677450
avg read GB/sec                 0.223024   0.001942
avg write GB/sec                0.000272   0.000421
read GB / read_time_secs        0.610834   0.013128
write GB / write_time_secs      0.025346   0.009391
read GB                         1.064965   0.000002
write GB                        0.001309   0.002031
read_time_secs                  1.744000   0.037510
write_time_secs                 0.041333   0.061207
busy_time_secs                  4.228000   0.053814

0 replies

JackKelly · 2023-10-20T18:15:46Z

JackKelly
Oct 20, 2023
Maintainer Author

I think my next step is to benchmark TensorStore. And then to start work on my Rust implementation of Zarr (starting with a bunch of performance experiments 🙂 )

0 replies

rabernat · 2023-10-20T18:16:16Z

rabernat
Oct 20, 2023
Maintainer

Fascinating stuff Jack! Thanks for sharing!

The memory copies seem like a low hanging fruit. I know that the Arrow community is obsessed with avoiding memory copies. It would be interesting to dig into where unnecessary memory copies might be avoided in Zarr.

It looks like the code is attempting to decompress directly into the target array memory when possible:

https://github.com/zarr-developers/zarr-python/blob/5a54c95e7438779f66e4fe2491e7a9238b6a43fb/zarr/core.py#L2092-L2105

0 replies

martindurant · 2023-10-20T18:16:42Z

martindurant
Oct 20, 2023

The "second memory copy" that Zarr has to do (to copy each chunk into the final numpy array) can be surprisingly costly (a remedy in a compiled language could be to copy each chunk in parallel, immediately after decompression, so the chunk is still in CPU cache).

This should be completely eliminated for the case of contiguous (non-strided) reads: you can readinto a memory space or decompress directly into it, no need to copy. numcodecs allows an out= argument for this purpose, is it not being used?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel VTune results for Zarr-Python reading entire array #22

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Intel VTune results for Zarr-Python reading entire array #22

JackKelly Oct 20, 2023 Maintainer

Replies: 5 comments

JackKelly Oct 20, 2023 Maintainer Author

IO bandwidth

JackKelly Oct 20, 2023 Maintainer Author

JackKelly Oct 20, 2023 Maintainer Author

rabernat Oct 20, 2023 Maintainer

martindurant Oct 20, 2023

JackKelly
Oct 20, 2023
Maintainer

JackKelly
Oct 20, 2023
Maintainer Author

JackKelly
Oct 20, 2023
Maintainer Author

JackKelly
Oct 20, 2023
Maintainer Author

rabernat
Oct 20, 2023
Maintainer

martindurant
Oct 20, 2023