Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

jrevels · 2016-02-18T18:29:43Z

Prompted by discussion with @yuyichao and @carnaval, this is essentially a revival of this older branch.

This should align the requested memory with cache lines, thus improving register loads (and by extension, SIMD performance). We'll still request 16-byte aligned memory unless the object/array is "large enough" (determined by what the GC considers "big" or, in the case of arrays, a comparison with ARRAY_INLINE_NBYTES).

This change also replaces a bunch of magic alignment numbers with named constants (SMALL_BYTE_ALIGNMENT for 16-byte aligned memory and CACHE_BYTE_ALIGNMENT for 64-byte aligned memory).

This is my first time poking around C code in a real-world context, so apologies in advance for any silly mistakes.

runbenchmarks(ALL, vs = "JuliaLang/julia:master")

jrevels · 2016-02-18T18:54:01Z

src/gc.c

+    void *_padding[8 - 4];
+#else
+    void *_padding[16 - 4];
+#endif


Took this from here, it's necessary for ensuring that GC header doesn't accidentally offset data from 64-byte alignment. Is there any value to keeping these values 8 - 4 and 16 - 4, or should I replace them with 4 and 12 respectively?

are these derived from some of the defines?

@tkelman Sorry, I missed your comment the first time around. The short answer is no, they're derived from pointer size + the number of other pointers in this struct. I've decided the best way to make this clear is to just add comments that explain the padding.

tkelman · 2016-02-18T19:23:35Z

Is this necessary, or should it be smaller, on 32 bit?

jrevels · 2016-02-18T19:47:00Z

I believe this change applies equally to both 64 bit and 32 bit systems, since the cache line size is standardized at 64 bytes for both 32 bit and 64 bit processors, and 32 bit processors that implement SSE include 128-bit wide XMM registers (I think).

nanosoldier · 2016-02-18T21:16:07Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

yuyichao · 2016-02-19T04:17:45Z

The size we care about here is the size of the SIMD instructions, which does not really depend on the mode of the CPU but does depend on the available instruction set.

In principle, for all the currently released general purpose CPU's, the alignment we care about is only 32bytes (and this does matter a lot on Haswell with AVX2). However, given 64bytes is the size of the cache line and hopefully AVX512 is on the horizon, providing 64bytes alignment is a little more future proof.

eschnett · 2016-02-19T15:14:41Z

Why can't this be architecture dependent?

The Blue Gene/Q has 128-byte cache lines. I hear that some Nvidia GPUs also use 128 bytes.

Aligning to a cache line also avoids false sharing. However, that doesn't really help if only large objects are aligned.

carnaval · 2016-02-19T15:24:23Z

Here "large" is still relatively small (>= 2k) and aligning smaller things to cache line boundary seems somewhat wasteful since we have homogeneous memory pools, except for a couple special cases where the size of the cell (tag + object) is exactly a cache line.

I don't think memory on the GPU would be allocated by the gc anyway, but sure, if you have a practical need for julia on an arch with a different cache line size we can make it a compile time choice.

carnaval · 2016-02-19T15:35:09Z

Also aligning things to cache line should be good as long as they are a multiple of your element size even if not using 512 bit vectors since I'm pretty sure most micro architectures have a penalty for loads that cross cache line boundaries.

yuyichao · 2016-02-21T02:01:32Z

LGTM (assuming you've studied the effect on array/simd benchmarks). There should probably be a test to check the alignment of a medium side array.

A note about the effect of this on benchmarks, as @carnaval pointed out, this should mainly affect tight simd loops, where the cacheline split is significant compare to the calculation, on medium size arrays (fits in cache) where the loop is not memory bandwidth limited.

ViralBShah · 2016-02-25T07:47:58Z

Can we backport this?

jrevels · 2016-02-26T18:21:21Z

Can we backport this?

I'll leave that up to @tkelman, but keep in mind I've only tested the performance of this against master. We'd probably want to also test it against v0.4 if we're thinking of backporting it.

jrevels · 2016-03-03T16:42:31Z

Shall I merge this?

yuyichao · 2016-03-03T17:12:20Z

After adding a test?

This should align the requested memory with cache lines, thus improving register loads.

jrevels · 2016-03-09T18:07:37Z

Rebased and added the test (sorry that took so long, first day back in the office since my own personal flupocalypse). I'll merge this after CI passes if there are no other concerns.

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects

jrevels reviewed Feb 18, 2016
View reviewed changes

jrevels force-pushed the jr/alignment branch from 2cdcaec to 7433fd1 Compare February 18, 2016 19:21

jrevels force-pushed the jr/alignment branch from 7433fd1 to f772c9a Compare February 24, 2016 20:08

jrevels force-pushed the jr/alignment branch from f772c9a to de9fc94 Compare March 1, 2016 15:03

KristofferC mentioned this pull request Nov 26, 2024

Simple gemm loop slower on 0.5 JuliaLang/LinearAlgebra.jl#312

Closed

jrevels added 2 commits March 9, 2016 12:32

request 64-byte alignment instead of 16-byte alignment for large objects

3e2d343

This should align the requested memory with cache lines, thus improving register loads.

add test to check that medium sized arrays are 64-byte aligned

8ee170f

jrevels force-pushed the jr/alignment branch from de9fc94 to 8ee170f Compare March 9, 2016 18:05

jrevels added a commit that referenced this pull request Mar 9, 2016

Merge pull request #15139 from JuliaLang/jr/alignment

ad101c9

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects

jrevels merged commit ad101c9 into master Mar 9, 2016

jrevels deleted the jr/alignment branch March 9, 2016 22:18

KristofferC mentioned this pull request Mar 14, 2017

Segfault with longer SIMD vectors #20961

Closed

yuyichao mentioned this pull request Jun 20, 2021

fix excess array object padding #41287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

jrevels commented Feb 18, 2016

jrevels Feb 18, 2016

tkelman Feb 18, 2016

jrevels Feb 24, 2016

tkelman commented Feb 18, 2016

jrevels commented Feb 18, 2016

nanosoldier commented Feb 18, 2016

yuyichao commented Feb 19, 2016

eschnett commented Feb 19, 2016

carnaval commented Feb 19, 2016

carnaval commented Feb 19, 2016

yuyichao commented Feb 21, 2016

ViralBShah commented Feb 25, 2016

jrevels commented Feb 26, 2016

jrevels commented Mar 3, 2016

yuyichao commented Mar 3, 2016

jrevels commented Mar 9, 2016

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

Request 64-byte aligned memory instead of 16-byte aligned memory for large objects #15139

Conversation

jrevels commented Feb 18, 2016

jrevels Feb 18, 2016

Choose a reason for hiding this comment

tkelman Feb 18, 2016

Choose a reason for hiding this comment

jrevels Feb 24, 2016

Choose a reason for hiding this comment

tkelman commented Feb 18, 2016

jrevels commented Feb 18, 2016

nanosoldier commented Feb 18, 2016

yuyichao commented Feb 19, 2016

eschnett commented Feb 19, 2016

carnaval commented Feb 19, 2016

carnaval commented Feb 19, 2016

yuyichao commented Feb 21, 2016

ViralBShah commented Feb 25, 2016

jrevels commented Feb 26, 2016

jrevels commented Mar 3, 2016

yuyichao commented Mar 3, 2016

jrevels commented Mar 9, 2016