-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve large array allocation speed #14177
Comments
There are probably three things involved here: the OS, memory bandwidth (initialize to 1), and the numpy/python allocation implementation. How does the speed of empty/zeros compare? @juliantaylor would know more about this. |
They're free to start, but we'll suffer the allocation costs eventually when we write to those pages In [1]: import numpy
In [2]: %%time
...: a = numpy.zeros(shape=(1000000000), dtype='u1') # 1 GB
...:
...:
CPU times: user 16 µs, sys: 9 µs, total: 25 µs
Wall time: 29.8 µs
In [3]: %%time
...: a = numpy.empty(shape=(1000000000), dtype='u1') # 1 GB
...:
...:
CPU times: user 17 µs, sys: 11 µs, total: 28 µs
Wall time: 33.1 µs |
I get slightly faster times for the uninitialized allocations, however,
This is on a six year old desktop |
And perhaps compiler also, this is on fedora 30, gcc 9.1.1. |
I would love it if I could get that performance :) I'm now very curious what the determining factors are here. |
I'd have to check in the BIOS to see how the memory is actually set up, I don't think DDR4 is supported, so am not sure any more what is actually in there. It may be overclocked a bit. EDIT: G.SKILL · DDR3 · 8 GB · 1,866 MHz · 240-pin |
My guess it that it is a combination of compiler and laptop. IIRC, laptops tend to have slow memory. When I bought my current setup I think common laptops were running about 1/4 the speed. Power saving settings can also have a big impact. |
I don't think that the memory frequency is the determining factor here. I'm on a desktop with In [1]: import numpy as np
In [2]: np.__version__
Out[2]: '1.17.0'
In [3]: %%time
...: a = np.ones(int(1e9), dtype='u1')
...:
...:
CPU times: user 128 ms, sys: 208 ms, total: 336 ms
Wall time: 335 ms I think it may have something to do with the compiler or some Linux configuration. @charris if you watch |
@pentschev Yes. |
I wonder if it may have something to do with disk backup of the files in memory. I'm running two SSDs in RAID 0 (yes, I live dangerously :) |
(If it is slow) could there also be things about EDIT: There is also a note about |
@seberg I had checked that before and In [3]: %%time
...: a = np.ones(int(1e9), dtype='u1')
...:
...:
CPU times: user 48.1 ms, sys: 93.5 ms, total: 142 ms
Wall time: 141 ms So it looks like https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L77-L84 |
Not really sure why I had to have
|
I should mention that my memory has dual channel access (two slots out of four used), which might also make a difference. EDIT: Dual channel memory in two slots, memory is overclocked. Specs for the Asus Z87I-DELUXE motherboard. |
@pentschev What does the bios say about your memory setup? |
I only have a single memory stick installed, so no dual channel in that case. I do have access to dual-CPU Xeon @2.20 GHz machines (20+20 cores / 40+40 threads) and 8x32 GB @ 2133 MHz on each CPU, but it's even slower (I guess due to CPU frequency): Without hugepage: n [3]: %%time
...: a = np.ones(int(1e9), dtype='u1')
...:
...:
CPU times: user 68 ms, sys: 376 ms, total: 444 ms
Wall time: 443 ms With hugepage: In [4]: %%time
...: a = np.ones(int(1e9), dtype='u1')
...:
...:
CPU times: user 40 ms, sys: 208 ms, total: 248 ms
Wall time: 249 ms |
And by with/without hugepage I mean |
After some more debugging, I found out that the use of hugepage is heavily dependant on the build. If I build it locally on my environment, I can actually enjoy faster allocation with https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L28-L33 I'm wondering now, is this something that we should try to have runtime check to enable that depending on what the target environment has available rather than compile time checks? cc @jakirkham |
That's a nice discovery
…On Mon, Aug 5, 2019 at 2:59 AM Peter Andreas Entschev < ***@***.***> wrote:
After some more debugging, I found out that the use of hugepage is heavily
dependant on the build. If I build it locally on my environment, I can
actually enjoy faster allocation with madvise set, but the release
packages from conda-forge or pip (apparently there's no anaconda package
for 1.17 yet) don't enable that, and to me it seems that it happens because
the following checks didn't pass at compile time:
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c#L28-L33
I'm wondering now, is this something that we should try to have runtime
check to enable that depending on what the target environment has available
rather than compile time checks?
cc @jakirkham <https://github.com/jakirkham>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14177?email_source=notifications&email_token=AACKZTFAGYFZTUFFNX7ILY3QC72YVA5CNFSM4IIR7MP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3RKKYI#issuecomment-518169953>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTFODZNAS37VYHFIOSDQC72YVANCNFSM4IIR7MPQ>
.
|
I was reading more about
If this sounds like an acceptable solution, I can submit a PR. So please let me know if you agree/disagree with the proposed solution. |
cc @jjhelmus from Anaconda |
cc @hmaarrfk (who may also have interest in this) |
definitely wondering why numpy was still slow even though that patch was accepted (i think in time for 1.16). |
my 2 cents: I don't think you should warn people that are running centos 6 about anything related to centos 6. They know they are running an old OS. They have their reasons, and they know new kernel features are what they are missing out on. A runtime check that helps packages give people on mondern operating systems the features they deserve would be really appreciated! |
I doubt this is the root case. The processor is likely thrashing, invalidating its own cache, and causing cache misses to occur. Unless properly tuned, most OSes are not tuned to work for high memory systems. Finally, this blog post shows that in fact there is a large difference in performance can be obtained by using the correct instruction for large memory copies. |
I agree, was just pointing it as an option.
That's what I thought, I will work on a PR for that. |
I was mentioning frequency as one of the causes of slower allocation, but certainly there will be others, as you duly noted.
Thanks for the link, it's a really concise work. Perhaps some of those ideas can be integrated in the future to take full advantage of available hardware, since unfortunately distributed packages may not contain all auto-generated compiler optimizations one could hope for. |
We (Anaconda) are working on getting 1.17.0 packages out. It may take a few additional days as we work out some issue around building with MKL as the BLAS backend. |
There seem to be multiple BLAS-related build issues; BLIS and ATLAS also seemed to have problems. If it's due to something we changed/broke, perhaps with the introduction of |
@jjhelmus the ask here isn't "please get 1.17 packages out" it's "when building those packages, please be aware that it would be nice to build with support for these allocation speed improvements" |
(regardless, thanks for managing Numpy packaging) |
Theres a way to package `manylinux20XX`, where `20XX` can be any year. Many
the infrastructure should start to support this?
…On Thu, Aug 8, 2019 at 6:08 PM Matthew Rocklin ***@***.***> wrote:
(regardless, thanks for managing Numpy packaging)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14177?email_source=notifications&email_token=AAAV7GHIM5PVG7UZBCW6G4LQDSKPNA5CNFSM4IIR7MP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35BGYA#issuecomment-519705440>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAV7GHCJZ26KERFU366TQ3QDSKPNANCNFSM4IIR7MPQ>
.
|
AIUI this is no longer the ask. Instead what is being requested is let's detect the machine's capabilities at runtime and always use these allocation improvements if the machine supports it. IOW no need to do custom builds of NumPy for this functionality as it will be there by default. 🙂 |
And just to complement @jakirkham's comment, I think we should try to the extent possible to ensure that optimizations are always built-in, rather than transferring the responsibility to package maintainers, since it can be difficult to find out about all existing ones and how to properly enable them at compile time (and even more difficult to verify whether they were properly enabled by both package maintainers and users). |
I would like to be able to allocate NumPy arrays quickly. Currently calling
numpy.ones
runs at about 2-3GB/s on consumer laptops (tested a 2018 Macbook pro and ThinkPad Carbon 4th gen with an i7)My general intuition is that I should be able to allocate memory more quickly than this. Is that true? Is there anything that can be done here?
Reproducing code example:
Numpy/Python version information:
The text was updated successfully, but these errors were encountered: