-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139
Conversation
void *_padding[8 - 4]; | ||
#else | ||
void *_padding[16 - 4]; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took this from here, it's necessary for ensuring that GC header doesn't accidentally offset data
from 64-byte alignment. Is there any value to keeping these values 8 - 4
and 16 - 4
, or should I replace them with 4
and 12
respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these derived from some of the defines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tkelman Sorry, I missed your comment the first time around. The short answer is no, they're derived from pointer size + the number of other pointers in this struct. I've decided the best way to make this clear is to just add comments that explain the padding.
2cdcaec
to
7433fd1
Compare
Is this necessary, or should it be smaller, on 32 bit? |
I believe this change applies equally to both 64 bit and 32 bit systems, since the cache line size is standardized at 64 bytes for both 32 bit and 64 bit processors, and 32 bit processors that implement SSE include 128-bit wide XMM registers (I think). |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
The size we care about here is the size of the SIMD instructions, which does not really depend on the mode of the CPU but does depend on the available instruction set. In principle, for all the currently released general purpose CPU's, the alignment we care about is only 32bytes (and this does matter a lot on Haswell with AVX2). However, given 64bytes is the size of the cache line and hopefully AVX512 is on the horizon, providing 64bytes alignment is a little more future proof. |
Why can't this be architecture dependent? The Blue Gene/Q has 128-byte cache lines. I hear that some Nvidia GPUs also use 128 bytes. Aligning to a cache line also avoids false sharing. However, that doesn't really help if only large objects are aligned. |
Here "large" is still relatively small (>= 2k) and aligning smaller things to cache line boundary seems somewhat wasteful since we have homogeneous memory pools, except for a couple special cases where the size of the cell (tag + object) is exactly a cache line. I don't think memory on the GPU would be allocated by the gc anyway, but sure, if you have a practical need for julia on an arch with a different cache line size we can make it a compile time choice. |
Also aligning things to cache line should be good as long as they are a multiple of your element size even if not using 512 bit vectors since I'm pretty sure most micro architectures have a penalty for loads that cross cache line boundaries. |
LGTM (assuming you've studied the effect on array/simd benchmarks). There should probably be a test to check the alignment of a medium side array. A note about the effect of this on benchmarks, as @carnaval pointed out, this should mainly affect tight simd loops, where the cacheline split is significant compare to the calculation, on medium size arrays (fits in cache) where the loop is not memory bandwidth limited. |
7433fd1
to
f772c9a
Compare
Can we backport this? |
I'll leave that up to @tkelman, but keep in mind I've only tested the performance of this against master. We'd probably want to also test it against v0.4 if we're thinking of backporting it. |
Shall I merge this? |
After adding a test? |
This should align the requested memory with cache lines, thus improving register loads.
Rebased and added the test (sorry that took so long, first day back in the office since my own personal flupocalypse). I'll merge this after CI passes if there are no other concerns. |
Request 64-byte aligned memory instead of 16-byte aligned memory for large objects
Prompted by discussion with @yuyichao and @carnaval, this is essentially a revival of this older branch.
This should align the requested memory with cache lines, thus improving register loads (and by extension, SIMD performance). We'll still request 16-byte aligned memory unless the object/array is "large enough" (determined by what the GC considers "big" or, in the case of arrays, a comparison with
ARRAY_INLINE_NBYTES
).This change also replaces a bunch of magic alignment numbers with named constants (
SMALL_BYTE_ALIGNMENT
for 16-byte aligned memory andCACHE_BYTE_ALIGNMENT
for 64-byte aligned memory).This is my first time poking around C code in a real-world context, so apologies in advance for any silly mistakes.
runbenchmarks(ALL, vs = "JuliaLang/julia:master")