Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve large array allocation speed #14177

Closed
mrocklin opened this issue Aug 1, 2019 · 34 comments · Fixed by #14216
Closed

Improve large array allocation speed #14177

mrocklin opened this issue Aug 1, 2019 · 34 comments · Fixed by #14216

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Aug 1, 2019

I would like to be able to allocate NumPy arrays quickly. Currently calling numpy.ones runs at about 2-3GB/s on consumer laptops (tested a 2018 Macbook pro and ThinkPad Carbon 4th gen with an i7)

My general intuition is that I should be able to allocate memory more quickly than this. Is that true? Is there anything that can be done here?

Reproducing code example:

import numpy

a = numpy.ones(shape=(1000000000), dtype='u1')  # 1 GB

# CPU times: user 293 ms, sys: 286 ms, total: 579 ms
# Wall time: 580 ms

>>> numpy.__version__
'1.17.0'

Numpy/Python version information:

1.17.0 3.7.1 (default, Oct 23 2018, 14:07:42)
[Clang 4.0.1 (tags/RELEASE_401/final)]
@charris
Copy link
Member

charris commented Aug 1, 2019

There are probably three things involved here: the OS, memory bandwidth (initialize to 1), and the numpy/python allocation implementation. How does the speed of empty/zeros compare? @juliantaylor would know more about this.

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 1, 2019

How does the speed of empty/zeros compare?

They're free to start, but we'll suffer the allocation costs eventually when we write to those pages

In [1]: import numpy

In [2]: %%time
   ...: a = numpy.zeros(shape=(1000000000), dtype='u1')  # 1 GB
   ...:
   ...:
CPU times: user 16 µs, sys: 9 µs, total: 25 µs
Wall time: 29.8 µs

In [3]: %%time
   ...: a = numpy.empty(shape=(1000000000), dtype='u1')  # 1 GB
   ...:
   ...:
CPU times: user 17 µs, sys: 11 µs, total: 28 µs
Wall time: 33.1 µs

@charris
Copy link
Member

charris commented Aug 1, 2019

I get slightly faster times for the uninitialized allocations, however, np.ones is much faster

In [7]: %%time 
   ...: a = numpy.ones(shape=(1000000000), dtype='u1') 
   ...:  
   ...:                                                                         
CPU times: user 26 ms, sys: 55.5 ms, total: 81.4 ms
Wall time: 81.7 ms

This is on a six year old desktop Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz with 1866MHz 2x8GB (16GB) memory. It seems that memory bandwidth is the determining factor here.

@charris
Copy link
Member

charris commented Aug 1, 2019

And perhaps compiler also, this is on fedora 30, gcc 9.1.1.

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 1, 2019

I would love it if I could get that performance :)

I'm now very curious what the determining factors are here.

@charris
Copy link
Member

charris commented Aug 1, 2019

I'd have to check in the BIOS to see how the memory is actually set up, I don't think DDR4 is supported, so am not sure any more what is actually in there. It may be overclocked a bit.

EDIT: G.SKILL · DDR3 · 8 GB · 1,866 MHz · 240-pin

@charris
Copy link
Member

charris commented Aug 1, 2019

My guess it that it is a combination of compiler and laptop. IIRC, laptops tend to have slow memory. When I bought my current setup I think common laptops were running about 1/4 the speed. Power saving settings can also have a big impact.

@pentschev
Copy link
Contributor

I don't think that the memory frequency is the determining factor here. I'm on a desktop with i7-7800X @3.50 GHz, 1x16GB DDR4 @2400 MHz, CPU governor set to performance and I am only seeing a bit faster results than @mrocklin:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.17.0'

In [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 128 ms, sys: 208 ms, total: 336 ms
Wall time: 335 ms

I think it may have something to do with the compiler or some Linux configuration.

@charris if you watch top/htop/free do you see 1GB more of resident memory being allocated after np.ones get executed?

@charris
Copy link
Member

charris commented Aug 1, 2019

@pentschev Yes.

@charris
Copy link
Member

charris commented Aug 1, 2019

I wonder if it may have something to do with disk backup of the files in memory. I'm running two SSDs in RAID 0 (yes, I live dangerously :)

@seberg
Copy link
Member

seberg commented Aug 1, 2019

(If it is slow) could there also be things about MADV_HUGEPAGE not kicking in, possibly even because of OS settings?

EDIT: There is also a note about hugepages in the 1.16 release notes, for linux /sys/kernel/mm/transparent_hugepage/enabled should be madvise or always.

@pentschev
Copy link
Contributor

@seberg I had checked that before and madvise was. Just for peace-of-mind I tried setting it to always now, and I do see an improvement after that:

In [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 48.1 ms, sys: 93.5 ms, total: 142 ms
Wall time: 141 ms

So it looks like madvise may not be enough for all cases. Also, in NumPy if MADV_HUGEPAGE is available, it's supposed to be default for any buffer sizes >= 4MB:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L77-L84

@pentschev
Copy link
Contributor

Not really sure why I had to have always set for this to be faster, from the kernel docs:

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

@charris
Copy link
Member

charris commented Aug 1, 2019

I should mention that my memory has dual channel access (two slots out of four used), which might also make a difference.

EDIT: Dual channel memory in two slots, memory is overclocked. Specs for the Asus Z87I-DELUXE motherboard.

@charris
Copy link
Member

charris commented Aug 1, 2019

@pentschev What does the bios say about your memory setup?

@pentschev
Copy link
Contributor

I only have a single memory stick installed, so no dual channel in that case. I do have access to dual-CPU Xeon @2.20 GHz machines (20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency):

Without hugepage:

n [3]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 68 ms, sys: 376 ms, total: 444 ms
Wall time: 443 ms

With hugepage:

In [4]: %%time
   ...: a = np.ones(int(1e9), dtype='u1')
   ...:
   ...:
CPU times: user 40 ms, sys: 208 ms, total: 248 ms
Wall time: 249 ms

@pentschev
Copy link
Contributor

And by with/without hugepage I mean /sys/kernel/mm/transparent_hugepage/enabled set to always and madvise, respectively.

@pentschev
Copy link
Contributor

After some more debugging, I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with madvise set, but the release packages from conda-forge or pip (apparently there's no anaconda package for 1.17 yet) don't enable that, and to me it seems that it happens because the following checks didn't pass at compile time:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L28-L33

I'm wondering now, is this something that we should try to have runtime check to enable that depending on what the target environment has available rather than compile time checks?

cc @jakirkham

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 5, 2019 via email

@pentschev
Copy link
Contributor

I was reading more about madvise() and its man-pages says that MADV_HUGEPAGE is only available for Linux (since 2.6.38). That said, my proposal is to do the following:

  1. Check whether during build whether it's Linux and only include necessary headers then;
  2. Call madvise() and ignore it if EINVAL is returned (optionally, we could also raise a warning to the user).

If this sounds like an acceptable solution, I can submit a PR. So please let me know if you agree/disagree with the proposed solution.

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 6, 2019

I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with madvise set, but the release packages from conda-forge or pip (apparently there's no anaconda package for 1.17 yet) don't enable that

cc @jjhelmus from Anaconda

@jakirkham
Copy link
Contributor

cc @hmaarrfk (who may also have interest in this)

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Aug 7, 2019

definitely wondering why numpy was still slow even though that patch was accepted (i think in time for 1.16).

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Aug 7, 2019

my 2 cents: I don't think you should warn people that are running centos 6 about anything related to centos 6. They know they are running an old OS. They have their reasons, and they know new kernel features are what they are missing out on.

A runtime check that helps packages give people on mondern operating systems the features they deserve would be really appreciated!

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Aug 7, 2019

20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency

I doubt this is the root case. The processor is likely thrashing, invalidating its own cache, and causing cache misses to occur. Unless properly tuned, most OSes are not tuned to work for high memory systems.

Finally, this blog post shows that in fact there is a large difference in performance can be obtained by using the correct instruction for large memory copies.
https://github.com/awreece/memory-bandwidth-demo

@pentschev
Copy link
Contributor

my 2 cents: I don't think you should warn people that are running centos 6 about anything related to centos 6. They know they are running an old OS. They have their reasons, and they know new kernel features are what they are missing out on.

I agree, was just pointing it as an option.

A runtime check that helps packages give people on mondern operating systems the features they deserve would be really appreciated!

That's what I thought, I will work on a PR for that.

@pentschev
Copy link
Contributor

20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency

I doubt this is the root case. The processor is likely thrashing, invalidating its own cache, and causing cache misses to occur. Unless properly tuned, most OSes are not tuned to work for high memory systems.

I was mentioning frequency as one of the causes of slower allocation, but certainly there will be others, as you duly noted.

Finally, this blog post shows that in fact there is a large difference in performance can be obtained by using the correct instruction for large memory copies.
https://github.com/awreece/memory-bandwidth-demo

Thanks for the link, it's a really concise work. Perhaps some of those ideas can be integrated in the future to take full advantage of available hardware, since unfortunately distributed packages may not contain all auto-generated compiler optimizations one could hope for.

@jjhelmus
Copy link
Contributor

jjhelmus commented Aug 8, 2019

We (Anaconda) are working on getting 1.17.0 packages out. It may take a few additional days as we work out some issue around building with MKL as the BLAS backend.

@rgommers
Copy link
Member

rgommers commented Aug 8, 2019

There seem to be multiple BLAS-related build issues; BLIS and ATLAS also seemed to have problems. If it's due to something we changed/broke, perhaps with the introduction of NPY_BLAS_ORDER, please open a new issue for MKL.

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 8, 2019

@jjhelmus the ask here isn't "please get 1.17 packages out" it's "when building those packages, please be aware that it would be nice to build with support for these allocation speed improvements"

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 8, 2019

(regardless, thanks for managing Numpy packaging)

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Aug 9, 2019 via email

@jakirkham
Copy link
Contributor

"when building those packages, please be aware that it would be nice to build with support for these allocation speed improvements"

AIUI this is no longer the ask. Instead what is being requested is let's detect the machine's capabilities at runtime and always use these allocation improvements if the machine supports it. IOW no need to do custom builds of NumPy for this functionality as it will be there by default. 🙂

@pentschev
Copy link
Contributor

And just to complement @jakirkham's comment, I think we should try to the extent possible to ensure that optimizations are always built-in, rather than transferring the responsibility to package maintainers, since it can be difficult to find out about all existing ones and how to properly enable them at compile time (and even more difficult to verify whether they were properly enabled by both package maintainers and users).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants