Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

Merged
merged 2 commits into from
Mar 9, 2016

Conversation

jrevels
Copy link
Member

@jrevels jrevels commented Feb 18, 2016

Prompted by discussion with @yuyichao and @carnaval, this is essentially a revival of this older branch.

This should align the requested memory with cache lines, thus improving register loads (and by extension, SIMD performance). We'll still request 16-byte aligned memory unless the object/array is "large enough" (determined by what the GC considers "big" or, in the case of arrays, a comparison with ARRAY_INLINE_NBYTES).

This change also replaces a bunch of magic alignment numbers with named constants (SMALL_BYTE_ALIGNMENT for 16-byte aligned memory and CACHE_BYTE_ALIGNMENT for 64-byte aligned memory).

This is my first time poking around C code in a real-world context, so apologies in advance for any silly mistakes.

runbenchmarks(ALL, vs = "JuliaLang/julia:master")

void *_padding[8 - 4];
#else
void *_padding[16 - 4];
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took this from here, it's necessary for ensuring that GC header doesn't accidentally offset data from 64-byte alignment. Is there any value to keeping these values 8 - 4 and 16 - 4, or should I replace them with 4 and 12 respectively?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these derived from some of the defines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkelman Sorry, I missed your comment the first time around. The short answer is no, they're derived from pointer size + the number of other pointers in this struct. I've decided the best way to make this clear is to just add comments that explain the padding.

@tkelman
Copy link
Contributor

tkelman commented Feb 18, 2016

Is this necessary, or should it be smaller, on 32 bit?

@jrevels
Copy link
Member Author

jrevels commented Feb 18, 2016

I believe this change applies equally to both 64 bit and 32 bit systems, since the cache line size is standardized at 64 bytes for both 32 bit and 64 bit processors, and 32 bit processors that implement SSE include 128-bit wide XMM registers (I think).

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@yuyichao
Copy link
Contributor

The size we care about here is the size of the SIMD instructions, which does not really depend on the mode of the CPU but does depend on the available instruction set.

In principle, for all the currently released general purpose CPU's, the alignment we care about is only 32bytes (and this does matter a lot on Haswell with AVX2). However, given 64bytes is the size of the cache line and hopefully AVX512 is on the horizon, providing 64bytes alignment is a little more future proof.

@eschnett
Copy link
Contributor

Why can't this be architecture dependent?

The Blue Gene/Q has 128-byte cache lines. I hear that some Nvidia GPUs also use 128 bytes.

Aligning to a cache line also avoids false sharing. However, that doesn't really help if only large objects are aligned.

@carnaval
Copy link
Contributor

Here "large" is still relatively small (>= 2k) and aligning smaller things to cache line boundary seems somewhat wasteful since we have homogeneous memory pools, except for a couple special cases where the size of the cell (tag + object) is exactly a cache line.

I don't think memory on the GPU would be allocated by the gc anyway, but sure, if you have a practical need for julia on an arch with a different cache line size we can make it a compile time choice.

@carnaval
Copy link
Contributor

Also aligning things to cache line should be good as long as they are a multiple of your element size even if not using 512 bit vectors since I'm pretty sure most micro architectures have a penalty for loads that cross cache line boundaries.

@yuyichao
Copy link
Contributor

LGTM (assuming you've studied the effect on array/simd benchmarks). There should probably be a test to check the alignment of a medium side array.

A note about the effect of this on benchmarks, as @carnaval pointed out, this should mainly affect tight simd loops, where the cacheline split is significant compare to the calculation, on medium size arrays (fits in cache) where the loop is not memory bandwidth limited.

@ViralBShah
Copy link
Member

Can we backport this?

@jrevels
Copy link
Member Author

jrevels commented Feb 26, 2016

Can we backport this?

I'll leave that up to @tkelman, but keep in mind I've only tested the performance of this against master. We'd probably want to also test it against v0.4 if we're thinking of backporting it.

@jrevels
Copy link
Member Author

jrevels commented Mar 3, 2016

Shall I merge this?

@yuyichao
Copy link
Contributor

yuyichao commented Mar 3, 2016

After adding a test?

@jrevels
Copy link
Member Author

jrevels commented Mar 9, 2016

Rebased and added the test (sorry that took so long, first day back in the office since my own personal flupocalypse). I'll merge this after CI passes if there are no other concerns.

jrevels added a commit that referenced this pull request Mar 9, 2016
Request 64-byte aligned memory instead of 16-byte aligned memory for large objects
@jrevels jrevels merged commit ad101c9 into master Mar 9, 2016
@jrevels jrevels deleted the jr/alignment branch March 9, 2016 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants