Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make hash table memory layout SIMD-independent #16

Merged

Conversation

michaelwoerister
Copy link
Member

@michaelwoerister michaelwoerister commented Sep 20, 2021

This PR makes sure that the memory layout of the hash table does not depend on whether SIMD is available on a platform or not. Previously, the size of the metadata array would be entry_count + GROUP_SIZE where GROUP_SIZE depends on how many metadata entries we can look at at once, which in turn depends on whether SIMD is present or not.

This PR changes this to the platform- and SIMD-support-independent REFERENCE_GROUP_SIZE so that the metadata array will always have the same size everywhere.

This should fix the issue report in rust-lang/rust#89085.

@michaelwoerister michaelwoerister force-pushed the fix-simd-vs-no-simd-group-sizes branch 3 times, most recently from 739a8b9 to 9ff30c6 Compare September 20, 2021 10:20
@michaelwoerister michaelwoerister force-pushed the fix-simd-vs-no-simd-group-sizes branch from 541bed1 to 91c66ce Compare September 20, 2021 11:31
@michaelwoerister michaelwoerister changed the title [WIP] Fix target-dependent encoding Make hash table memory layout SIMD-independent Sep 20, 2021
@michaelwoerister michaelwoerister marked this pull request as ready for review September 20, 2021 12:03
@shepmaster
Copy link
Member

whether SIMD is available on a platform or not

I am just passing by here, but both X86_64 and Apple Silicon have SIMD instructions. Why would this cause the problem that was reported?

@michaelwoerister michaelwoerister force-pushed the fix-simd-vs-no-simd-group-sizes branch from f62aaee to 920ce12 Compare September 20, 2021 12:28
@michaelwoerister
Copy link
Member Author

Right, SIMD might not be the correct term: it's SSE2 specifically that we are checking for.

@shepmaster
Copy link
Member

SIMD might not be the correct term

I expect that non-x86 architectures will become more prevalent in the near- and middle-term, so it may be worth avoiding conflating the two ideas when next that code is touched.

@hkratz
Copy link

hkratz commented Sep 20, 2021

Right, SIMD might not be the correct term: it's SSE2 specifically that we are checking for.

I don't really understand. When cross-compiling to ARM the target_arch is aarch64 and sse2 should not be in the target_feature list. Otherwise a whole lot of other things would break as well IMHO.

@michaelwoerister
Copy link
Member Author

The problem comes from cross compiling the standard library. We use an x86_64 compiler (which has SSE2) to generate the crate metadata that ends up in the Aarch64 standard library RLIBs. This crate metadata has odht hash tables in it, which are then decoded by an Aarch64 compiler (which does not have SSE2) when it is used to compile something.

Copy link
Member

@wesleywiser wesleywiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and the rest of the code seems to use REFERENCE_GROUP_SIZE in the appropriate places as well.

@michaelwoerister
Copy link
Member Author

I expect that non-x86 architectures will become more prevalent in the near- and middle-term, so it may be worth avoiding conflating the two ideas when next that code is touched.

As the term SIMD is used in this context, it is actually correct. The crate has a no_simd feature which force-disables the use of any SIMD instructions. The fix in this PR also makes the table layout independent of whether any form of SIMD is used. It is by chance that currently the only form of SIMD supported is SSE2.

So my comment from before should actually have been something like: Some platforms have a specialized SIMD-enabled implementation and prior to this PR the hash table memory layout depended on whether that was used or not -- when the memory layout really is required to be the same on all platforms and independent of whether any form of SIMD is used or not.

@workingjubilee
Copy link
Member

Are these given 16-byte alignment when marshaled into memory?

@michaelwoerister
Copy link
Member Author

@workingjubilee Can you elaborate what you mean?

@eddyb
Copy link
Member

eddyb commented Sep 21, 2021

I'm guessing there's a risk of baking in the SIMD alignment (which is 16 bytes for SSE2, as opposed to a maximum of 8 for anything not involving a SIMD vector type), into the shape of the on-disk data.

However, it looks like the problem was more with a count (like a "throughput factor"?) so I don't think it's likely that alignment factors into it (OTOH, if there's SIMD loads straight from memory-mapped storage, they would all have to be unaligned ones, but that's a separate concern).

EDIT: oh yeah, this is pretty straight-forward, and byte-oriented:

impl GroupQuery {
#[inline]
pub fn from(group: &[u8; GROUP_SIZE], h2: u8) -> GroupQuery {
assert!(std::mem::size_of::<x86::__m128i>() == GROUP_SIZE);
unsafe {
let group = x86::_mm_loadu_si128(group as *const _ as *const x86::__m128i);

@michaelwoerister
Copy link
Member Author

Yeah, these are all unaligned loads.

@workingjubilee
Copy link
Member

Yeah, I was mostly just glancing over the code and wasn't sure what the alignments wound up as.

My understanding is that it's somewhat preferable to just always align to 16 bytes in memory if you expect to use SIMD sometimes, as you don't pay a huge penalty for such an over-alignment while in-motion, but it doesn't matter when the data is at-rest on disk because the first pulls from disk will be somewhat slow anyways. Higher alignments (e.g. 32 bytes for _mm256) don't truly matter because the unaligned load/store penalty evaporates on AVX with VEX prefixes, and other higher-than-16-byte-SIMD processors don't require alignment, but SSE, AltiVec, and Neon all prefer 16 byte alignment to greater or lesser degrees.

@eddyb
Copy link
Member

eddyb commented Sep 21, 2021

I suspect these "groups" are always at an offset multiple of 16 in a larger byte array - could that byte array be aligned as a whole? Seems like it shouldn't have much of an overhead.

I'm actually curious now what the impact would be, we can probably run perf with rustc depending on a patched version of odht.

@michaelwoerister
Copy link
Member Author

Unfortunately a group can start at arbitrary indices, i.e. it will start at hash % table_size. That's just how swisstable works.

@eddyb
Copy link
Member

eddyb commented Sep 22, 2021

Ahh, my bad, was expecting groups to be chunks, but I guess they're windows (of buckets), that makes sense now.

@michaelwoerister
Copy link
Member Author

Yes, "window" is a more fitting better term. I used groups because that's what hashbrown calls them (not sure about the original SwissTable implementation).

@workingjubilee
Copy link
Member

workingjubilee commented Sep 23, 2021

Unfortunately, on some processors unaligned loads and stores don't work well for SIMD, being severely pessimized. Once upon a time, this included x86-64. For AArch64, the major "second target", however, the penalties are relatively light and narrow, applying only to

Load operations that cross a cache-line (64-byte) boundary
Store operations that cross a 16-byte boundary

So that's something to take into account. Probably not a serious concern, fwiw, since it mostly means a pessimization to scalar at worst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants