Improve large array allocation speed #14177

mrocklin · 2019-08-01T15:39:23Z

I would like to be able to allocate NumPy arrays quickly. Currently calling numpy.ones runs at about 2-3GB/s on consumer laptops (tested a 2018 Macbook pro and ThinkPad Carbon 4th gen with an i7)

My general intuition is that I should be able to allocate memory more quickly than this. Is that true? Is there anything that can be done here?

Reproducing code example:

import numpy

a = numpy.ones(shape=(1000000000), dtype='u1')  # 1 GB

# CPU times: user 293 ms, sys: 286 ms, total: 579 ms
# Wall time: 580 ms

>>> numpy.__version__
'1.17.0'

Numpy/Python version information:

1.17.0 3.7.1 (default, Oct 23 2018, 14:07:42)
[Clang 4.0.1 (tags/RELEASE_401/final)]

The text was updated successfully, but these errors were encountered:

charris · 2019-08-01T15:46:43Z

There are probably three things involved here: the OS, memory bandwidth (initialize to 1), and the numpy/python allocation implementation. How does the speed of empty/zeros compare? @juliantaylor would know more about this.

mrocklin · 2019-08-01T15:49:30Z

How does the speed of empty/zeros compare?

They're free to start, but we'll suffer the allocation costs eventually when we write to those pages

In [1]: import numpy

In [2]: %%time
   ...: a = numpy.zeros(shape=(1000000000), dtype='u1')  # 1 GB
   ...:
   ...:
CPU times: user 16 µs, sys: 9 µs, total: 25 µs
Wall time: 29.8 µs

In [3]: %%time
   ...: a = numpy.empty(shape=(1000000000), dtype='u1')  # 1 GB
   ...:
   ...:
CPU times: user 17 µs, sys: 11 µs, total: 28 µs
Wall time: 33.1 µs

charris · 2019-08-01T16:22:39Z

I get slightly faster times for the uninitialized allocations, however, np.ones is much faster

In [7]: %%time 
   ...: a = numpy.ones(shape=(1000000000), dtype='u1') 
   ...:  
   ...:                                                                         
CPU times: user 26 ms, sys: 55.5 ms, total: 81.4 ms
Wall time: 81.7 ms

This is on a six year old desktop Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz with 1866MHz 2x8GB (16GB) memory. It seems that memory bandwidth is the determining factor here.

charris · 2019-08-01T16:46:13Z

And perhaps compiler also, this is on fedora 30, gcc 9.1.1.

mrocklin · 2019-08-01T17:19:09Z

I would love it if I could get that performance :)

I'm now very curious what the determining factors are here.

charris · 2019-08-01T17:32:17Z

I'd have to check in the BIOS to see how the memory is actually set up, I don't think DDR4 is supported, so am not sure any more what is actually in there. It may be overclocked a bit.

EDIT: G.SKILL · DDR3 · 8 GB · 1,866 MHz · 240-pin

charris · 2019-08-01T18:11:27Z

My guess it that it is a combination of compiler and laptop. IIRC, laptops tend to have slow memory. When I bought my current setup I think common laptops were running about 1/4 the speed. Power saving settings can also have a big impact.

pentschev · 2019-08-01T18:21:18Z

I don't think that the memory frequency is the determining factor here. I'm on a desktop with i7-7800X @3.50 GHz, 1x16GB DDR4 @2400 MHz, CPU governor set to performance and I am only seeing a bit faster results than @mrocklin:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.17.0'

In [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 128 ms, sys: 208 ms, total: 336 ms
Wall time: 335 ms

I think it may have something to do with the compiler or some Linux configuration.

@charris if you watch top/htop/free do you see 1GB more of resident memory being allocated after np.ones get executed?

charris · 2019-08-01T18:26:45Z

@pentschev Yes.

charris · 2019-08-01T18:32:24Z

I wonder if it may have something to do with disk backup of the files in memory. I'm running two SSDs in RAID 0 (yes, I live dangerously :)

seberg · 2019-08-01T18:49:01Z

(If it is slow) could there also be things about MADV_HUGEPAGE not kicking in, possibly even because of OS settings?

EDIT: There is also a note about hugepages in the 1.16 release notes, for linux /sys/kernel/mm/transparent_hugepage/enabled should be madvise or always.

pentschev · 2019-08-01T19:01:25Z

@seberg I had checked that before and madvise was. Just for peace-of-mind I tried setting it to always now, and I do see an improvement after that:

In [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 48.1 ms, sys: 93.5 ms, total: 142 ms
Wall time: 141 ms

So it looks like madvise may not be enough for all cases. Also, in NumPy if MADV_HUGEPAGE is available, it's supposed to be default for any buffer sizes >= 4MB:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L77-L84

pentschev · 2019-08-01T19:15:33Z

Not really sure why I had to have always set for this to be faster, from the kernel docs:

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

charris · 2019-08-01T21:11:27Z

I should mention that my memory has dual channel access (two slots out of four used), which might also make a difference.

EDIT: Dual channel memory in two slots, memory is overclocked. Specs for the Asus Z87I-DELUXE motherboard.

charris · 2019-08-01T21:15:07Z

@pentschev What does the bios say about your memory setup?

pentschev · 2019-08-01T21:52:45Z

I only have a single memory stick installed, so no dual channel in that case. I do have access to dual-CPU Xeon @2.20 GHz machines (20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency):

Without hugepage:

n [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 68 ms, sys: 376 ms, total: 444 ms
Wall time: 443 ms

With hugepage:

In [4]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 40 ms, sys: 208 ms, total: 248 ms
Wall time: 249 ms

pentschev · 2019-08-01T21:53:58Z

And by with/without hugepage I mean /sys/kernel/mm/transparent_hugepage/enabled set to always and madvise, respectively.

pentschev · 2019-08-05T09:59:17Z

After some more debugging, I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with madvise set, but the release packages from conda-forge or pip (apparently there's no anaconda package for 1.17 yet) don't enable that, and to me it seems that it happens because the following checks didn't pass at compile time:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L28-L33

I'm wondering now, is this something that we should try to have runtime check to enable that depending on what the target environment has available rather than compile time checks?

cc @jakirkham

mrocklin · 2019-08-05T14:12:51Z

That's a nice discovery

…

On Mon, Aug 5, 2019 at 2:59 AM Peter Andreas Entschev < ***@***.***> wrote: After some more debugging, I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with madvise set, but the release packages from conda-forge or pip (apparently there's no anaconda package for 1.17 yet) don't enable that, and to me it seems that it happens because the following checks didn't pass at compile time: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L28-L33 I'm wondering now, is this something that we should try to have runtime check to enable that depending on what the target environment has available rather than compile time checks? cc @jakirkham <https://github.com/jakirkham> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14177?email_source=notifications&email_token=AACKZTFAGYFZTUFFNX7ILY3QC72YVA5CNFSM4IIR7MP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3RKKYI#issuecomment-518169953>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTFODZNAS37VYHFIOSDQC72YVANCNFSM4IIR7MPQ> .

pentschev · 2019-08-06T14:53:44Z

I was reading more about madvise() and its man-pages says that MADV_HUGEPAGE is only available for Linux (since 2.6.38). That said, my proposal is to do the following:

Check whether during build whether it's Linux and only include necessary headers then;
Call madvise() and ignore it if EINVAL is returned (optionally, we could also raise a warning to the user).

If this sounds like an acceptable solution, I can submit a PR. So please let me know if you agree/disagree with the proposed solution.

mrocklin · 2019-08-06T15:04:44Z

I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with madvise set, but the release packages from conda-forge or pip (apparently there's no anaconda package for 1.17 yet) don't enable that

cc @jjhelmus from Anaconda

jakirkham · 2019-08-07T00:33:09Z

cc @hmaarrfk (who may also have interest in this)

hmaarrfk · 2019-08-07T00:37:07Z

definitely wondering why numpy was still slow even though that patch was accepted (i think in time for 1.16).

hmaarrfk · 2019-08-07T00:53:31Z

my 2 cents: I don't think you should warn people that are running centos 6 about anything related to centos 6. They know they are running an old OS. They have their reasons, and they know new kernel features are what they are missing out on.

A runtime check that helps packages give people on mondern operating systems the features they deserve would be really appreciated!

hmaarrfk · 2019-08-07T03:25:22Z

20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency

I doubt this is the root case. The processor is likely thrashing, invalidating its own cache, and causing cache misses to occur. Unless properly tuned, most OSes are not tuned to work for high memory systems.

Finally, this blog post shows that in fact there is a large difference in performance can be obtained by using the correct instruction for large memory copies.
https://github.com/awreece/memory-bandwidth-demo

pentschev · 2019-08-07T08:16:32Z

my 2 cents: I don't think you should warn people that are running centos 6 about anything related to centos 6. They know they are running an old OS. They have their reasons, and they know new kernel features are what they are missing out on.

I agree, was just pointing it as an option.

A runtime check that helps packages give people on mondern operating systems the features they deserve would be really appreciated!

That's what I thought, I will work on a PR for that.

pentschev · 2019-08-07T08:22:25Z

20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency

I doubt this is the root case. The processor is likely thrashing, invalidating its own cache, and causing cache misses to occur. Unless properly tuned, most OSes are not tuned to work for high memory systems.

I was mentioning frequency as one of the causes of slower allocation, but certainly there will be others, as you duly noted.

Finally, this blog post shows that in fact there is a large difference in performance can be obtained by using the correct instruction for large memory copies.
https://github.com/awreece/memory-bandwidth-demo

Thanks for the link, it's a really concise work. Perhaps some of those ideas can be integrated in the future to take full advantage of available hardware, since unfortunately distributed packages may not contain all auto-generated compiler optimizations one could hope for.

jjhelmus · 2019-08-08T18:15:41Z

We (Anaconda) are working on getting 1.17.0 packages out. It may take a few additional days as we work out some issue around building with MKL as the BLAS backend.

rgommers · 2019-08-08T19:02:24Z

There seem to be multiple BLAS-related build issues; BLIS and ATLAS also seemed to have problems. If it's due to something we changed/broke, perhaps with the introduction of NPY_BLAS_ORDER, please open a new issue for MKL.

mrocklin · 2019-08-08T22:08:10Z

@jjhelmus the ask here isn't "please get 1.17 packages out" it's "when building those packages, please be aware that it would be nice to build with support for these allocation speed improvements"

mrocklin · 2019-08-08T22:08:34Z

(regardless, thanks for managing Numpy packaging)

hmaarrfk · 2019-08-09T02:23:08Z

Theres a way to package `manylinux20XX`, where `20XX` can be any year. Many the infrastructure should start to support this?

…

On Thu, Aug 8, 2019 at 6:08 PM Matthew Rocklin ***@***.***> wrote: (regardless, thanks for managing Numpy packaging) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14177?email_source=notifications&email_token=AAAV7GHIM5PVG7UZBCW6G4LQDSKPNA5CNFSM4IIR7MP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35BGYA#issuecomment-519705440>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAV7GHCJZ26KERFU366TQ3QDSKPNANCNFSM4IIR7MPQ> .

jakirkham · 2019-08-09T08:48:01Z

"when building those packages, please be aware that it would be nice to build with support for these allocation speed improvements"

AIUI this is no longer the ask. Instead what is being requested is let's detect the machine's capabilities at runtime and always use these allocation improvements if the machine supports it. IOW no need to do custom builds of NumPy for this functionality as it will be there by default. 🙂

pentschev · 2019-08-09T09:26:25Z

And just to complement @jakirkham's comment, I think we should try to the extent possible to ensure that optimizations are always built-in, rather than transferring the responsibility to package maintainers, since it can be difficult to find out about all existing ones and how to properly enable them at compile time (and even more difficult to verify whether they were properly enabled by both package maintainers and users).

mrocklin mentioned this issue Aug 1, 2019

Improve device memory spilling performance rapidsai/dask-cuda#98

Merged

pentschev mentioned this issue Aug 5, 2019

Evaluate further serialization performance improvements rapidsai/dask-cuda#106

Closed

pentschev mentioned this issue Aug 7, 2019

ENH: Enable huge pages in all Linux builds #14216

Merged

charris closed this as completed in #14216 Aug 20, 2019

This was referenced Aug 20, 2019

ENH: Enable huge pages in all Linux builds #14309

Merged

ENH: Enable huge pages in all Linux builds #14322

Merged

This was referenced Feb 5, 2020

numpy.bincount and scipy.ndimage.sum are slow under Windows #15513

Closed

many functions slow for bigger-than-cache arrays on linux from numpy>=1.16.5 #15545

Closed

jakirkham mentioned this issue Feb 22, 2022

Optionally use NumPy to allocate buffers dask/distributed#5750

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve large array allocation speed #14177

Improve large array allocation speed #14177

mrocklin commented Aug 1, 2019

charris commented Aug 1, 2019

mrocklin commented Aug 1, 2019

charris commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019

mrocklin commented Aug 1, 2019

charris commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019

pentschev commented Aug 1, 2019

charris commented Aug 1, 2019

charris commented Aug 1, 2019 •

edited

Loading

seberg commented Aug 1, 2019 •

edited

Loading

pentschev commented Aug 1, 2019

pentschev commented Aug 1, 2019

charris commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019

pentschev commented Aug 1, 2019

pentschev commented Aug 1, 2019

pentschev commented Aug 5, 2019

mrocklin commented Aug 5, 2019 via email

pentschev commented Aug 6, 2019

mrocklin commented Aug 6, 2019

jakirkham commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

pentschev commented Aug 7, 2019

pentschev commented Aug 7, 2019

jjhelmus commented Aug 8, 2019

rgommers commented Aug 8, 2019

mrocklin commented Aug 8, 2019

mrocklin commented Aug 8, 2019

hmaarrfk commented Aug 9, 2019 via email

jakirkham commented Aug 9, 2019

pentschev commented Aug 9, 2019

Improve large array allocation speed #14177

Improve large array allocation speed #14177

Comments

mrocklin commented Aug 1, 2019

Reproducing code example:

Numpy/Python version information:

charris commented Aug 1, 2019

mrocklin commented Aug 1, 2019

charris commented Aug 1, 2019 • edited Loading

charris commented Aug 1, 2019

mrocklin commented Aug 1, 2019

charris commented Aug 1, 2019 • edited Loading

charris commented Aug 1, 2019

pentschev commented Aug 1, 2019

charris commented Aug 1, 2019

charris commented Aug 1, 2019 • edited Loading

seberg commented Aug 1, 2019 • edited Loading

pentschev commented Aug 1, 2019

pentschev commented Aug 1, 2019

charris commented Aug 1, 2019 • edited Loading

charris commented Aug 1, 2019

pentschev commented Aug 1, 2019

pentschev commented Aug 1, 2019

pentschev commented Aug 5, 2019

mrocklin commented Aug 5, 2019 via email

pentschev commented Aug 6, 2019

mrocklin commented Aug 6, 2019

jakirkham commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

hmaarrfk commented Aug 7, 2019

pentschev commented Aug 7, 2019

pentschev commented Aug 7, 2019

jjhelmus commented Aug 8, 2019

rgommers commented Aug 8, 2019

mrocklin commented Aug 8, 2019

mrocklin commented Aug 8, 2019

hmaarrfk commented Aug 9, 2019 via email

jakirkham commented Aug 9, 2019

pentschev commented Aug 9, 2019

charris commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019 •

edited

Loading

seberg commented Aug 1, 2019 •

edited

Loading

charris commented Aug 1, 2019 •

edited

Loading