Seed HashMaps thread-locally, straight from the OS. #31356

huonw · 2016-02-02T01:32:34Z

This reduces how much rand is required in std, and makes creating
hash-maps noticably faster. The first invocation of HashMap::new() in a
thread goes from ~130_000 ns to ~1500 ns (i.e. 64x faster) and later
invocations (non-reseeding ones) go from 10-20 ns to a more consistent
1.8-2.0ns.

The mean for many invocations in a loop, on a single thread, goes from
78ns to ~1.9ns, and the standard deviation drops from 2800ns to
33ns (the massive variance of the old scheme comes from the occasional
reseeding of the thread rng).

These new numbers are only slightly higher than creating the
RandomState outside the loop and calling with_state in it (i.e.
only measuring the non-random parts of hashmap initialisation): 1.3 and
18 ns respectively.

This new scheme has the slight downside of being consistent on a single
thread, so users may unintentionally rely on the order being
fixed (e.g. two hashmaps with the same contents).

Closes #27243.

rust-highfive · 2016-02-02T01:32:46Z

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

huonw · 2016-02-02T01:35:25Z

cc @rust-lang/libs

This reduces how much rand is required in std

I only had a few minutes to write this, and wanted to get the conversation started if there is one. The next step will be a patch that moves the rest of the private std::rand out of std (there's some fiddly bits that may take a few compile-edit cycles, since various tests use it, which is why I haven't done it here).

brson · 2016-02-02T01:43:02Z

This new scheme has the slight downside of being consistent on a single
thread

Can you expand on what this means and why it is? I don't see why changing the source of randomness would affect the behavior.

brson · 2016-02-02T01:44:26Z

src/libstd/collections/hash/map.rs

+            let mut bytes = [0_u8; 16];
+            internal_rand::fill_bytes(&mut bytes);
+
+            let keys: (u64, u64) = unsafe { mem::transmute(bytes) };


Weird transmute. Is this really ok? Seems like the alignment of the u8 array is probably less than the u64 tuple.

Agreed, should start with keys, and transmute keys to bytes before passing them to internal_rand.

A transmute like this should work fine: it's reinterpreting the value, not the memory location. That is, let out = transmute::<A, B>(x) is/should be equivalent to (in C)

B out; memcpy(&out, &x, sizeof(B))

It would definitely be problematic if it was &[u8; 16] to &(u64, u64).

(If the above is not the case, it seems to me like it is an unnecessary footgun.)

(Do we guarantee tuple layout?)

No, but there's no way for that to matter here. If the transmute succeeds then you've loaded up 128 bits with randomness, and that's all that matters.

I am intrigued by the by-val transmute issue... I don't think I see a lot of transmuting of non-pointers... I have no idea how it should behave!

Adding padding will make the tuple larger, and hence cause the transmute to fail. Also, note that this is in the standard library, and hence can rely on current behaviours of the compiler.

@gankro You shouldn't see much transmutes of pointers either, since those are covered by regular casts (and &* to create a reference).

Maybe a more natural way to write this is to create a [u64; 2] and take a slice of it as &mut [u8] (its byte representation) and pass it to fill bytes?

That is, let out = transmute::<A, B>(x) is/should be equivalent to (in C) ... (memcpy)

should be, but istm llvm could decide there are alignment-dependent instructions that would work better. just handwaving

Maybe a more natural way to write this is to create a [u64; 2] and take a slice of it as &mut [u8](its byte representation) and pass it to fill bytes?

I did this originally, but it is significantly more fiddly.

llvm could decide there are alignment-dependent instructions that would work better. just handwaving

I think that would mean that LLVM is explicitly disobeying our instructions and hence would be a miscompilation. (I had a look at the IR, and it does compile down to a memcpy.)

Gankra · 2016-02-02T02:36:16Z

@brson today HashMap::new is like:

get_random_seed_from_os()

with the PR it's like

rngs.entry(thread_id).or_insert_with(get_random_seed_from_os)

Today if you make two HashMaps and perform identical operations on them, you'll get the different iteration orders. With this patch, they will iterate identically. Historically programmers have been all-too-happy to latch onto any perceived determinism in HashMap iteration order. This is why JavaScript now specifies that an object is semantically a LinkedHashMap. People relied on this fact, and when it didn't hold drop-downs got shuffled.

Now, this is less of a risk for Rust, because the JS case is relying on individual maps being really reliable, and this doesn't hold today or with this patch. However, one could imagine something to the effect of:

fn test() {
   let results: Vec<Vec<u8>> = vec![];
   for vec in get_data_sets() {
      let map = HashMap::new();
      for k in vec {
        map.insert(k, compute(k));
      }
      results.push(map.values().cloned().collect()); 
   }
   assert!(results[0] == results[1]);
}

Today this would be hopelessly broken, but with this patch this will "happen to work" if the input/output sequences line up. This is one of those situations where we can wag our finger and tell people not to do it, but at the end of the day it's externally observable it can and will be leveraged.

Note that the following will still be hopelessly and obviously broken:

fn test() {
  // I got this when I ran it the first time, shouldn't change!
  let expected = vec![1, 3, 5, 6, 11, 13];
  let map = HashMap::new();
  for k in get_test_input() {
    map.insert(k, compute(k));
  }
  let result = map.values().cloned().collect();
  assert_eq!(result, expected);
}

Which is the thing I would most worry about.

alexcrichton · 2016-02-02T03:40:12Z

src/libstd/internal_rand.rs

+
+pub use self::imp::fill_bytes;
+
+#[cfg(all(unix, not(target_os = "ios"), not(target_os = "openbsd")))]


Would it be possible to deduplicate this with the existing randomness support? At least in terms of an internal implementation detail it'd be good to not have to keep this in sync in two places.

As I mention above, I was/am going to remove the existing support from std since we only need what's in this file now. However, the rand crate will still need to exist for internal testing purposes; should it still be dedup'd with that? (It's definitely possible, it just means slightly more boilerplate in here.)

Ah sorry for letting this slip through the cracks, but yeah that's seems fine to me!

alexcrichton · 2016-02-02T03:42:00Z

Wow those are some impressive numbers! In the past I don't remember them being quite so dramatic...

If we're moving to thread-locals I wouldn't mind moving to full-blown process-globals. We've concluded in the past that there's no need in terms of DOS protection to have per-hashmap or even per-thread keys, a global one should be good enough.

huonw · 2016-02-02T03:57:33Z

Can you expand on what this means and why it is? I don't see why changing the source of randomness would affect the behavior.

Currently each hashmap has different keys since they retrieve new ones from the thread local on creation. This patch changes those u64s to be generated once and cached per-thread so all maps created on a single thread have the same keys.

Wow those are some impressive numbers! In the past I don't remember them being quite so dramatic...

It'll probably depend on the platform (incl. Linux kernel version), since that'll change the performance characteristics of retrieving random numbers (the old thread rng requires retrieving kilobytes, while this just needs a few).

If we're moving to thread-locals I wouldn't mind moving to full-blown process-globals. We've concluded in the past that there's no need in terms of DOS protection to have per-hashmap or even per-thread keys, a global one should be good enough.

Yeah, definitely possible, but this implementation is particularly simple (don't have to worry about unsafe/CASing anything etc). Very happy to switch to that if people prefer it.

ranma42 · 2016-02-02T09:42:06Z

It would be possible to address @gankro's concern by generating the actual keys with a per-thread PRNG seeded with the OS random data.

EDIT: but isn't this what the old implementation was already doing?

ranma42 · 2016-02-02T09:49:26Z

The improvement from 130_000 ns to 3000 ns looks huge. Would it make sense to investigate why thread_rng is so slow (and maybe backport these changes to it)?

huonw · 2016-02-02T11:08:21Z

EDIT: but isn't this what the old implementation was already doing?

Yes, it was, and it's why I was sure to point out this change has that downside in the PR. If we absolutely want to proactively help people avoid relying on hashmap order, a simpler and more efficient method would be to just increment one of the cached keys each time. (NB. this doesn't work quite so well with global rather than thread-local values: the locked increment/contention makes things slower.)

Would it make sense to investigate why thread_rng is so slow (and maybe backport these changes to it)?

Maybe, but it's almost certainly fundamental to the algorithm (it requires an extensive initialisation). In any case, the thread_rng in std is only used here, and hence is being/can be removed.

ranma42 · 2016-02-02T14:56:58Z

If the update of the global cached keys is that simple, we could do atomic increments. There might be some contention, but it should still be way faster than locking.

Another option would be to keep the thread-local approach and use a simpler (non crypto-safe) RNG to have cheaper initialisation and key generation. In the repo for the rand crate there has been some discussion about providing fast weak RNG, which might be a good fit for this purpose: they would be marginally slower than the no-op key copy and still shuffle the keys in a different way for each HashMap

This reduces how much rand is required in std, and makes creating hash-maps noticably faster. The first invocation of HashMap::new() in a thread goes from ~130_000 ns to ~1500 ns (i.e. 86x faster) and later invocations go from 10-20 ns to a more consistent 1.8-2.0ns. The mean for many invocations in a loop, on a single thread, goes from ~77ns to ~1.9ns, and the standard deviation drops from 2800ns to 33ns (the *massive* variance of the old scheme comes from the occasional reseeding of the thread rng). These new numbers are only slightly higher than creating the `RandomState` outside the loop and calling `with_state` in it (i.e. only measuring the non-random parts of hashmap initialisation): 1.3 and 18 ns respectively. This new scheme has the slight downside of being consistent on a single thread, so users may unintentionally rely on the order being fixed (e.g. two hashmaps with the same contents). Closes rust-lang#27243.

huonw · 2016-02-03T00:01:08Z

(Incidentally, my method of measuring had a lot of overhead due to the timers, I've updated the numbers.)

If the update of the global cached keys is that simple, we could do atomic increments. There might be some contention, but it should still be way faster than locking.

Yes, but... locking was never proposed? In the single threaded case (i.e. no contention), doing the atomic increment seems to be more than 2.8 times slower than the non-incrementing global version and both thread local ones (incrementing or not): it's about 5.4 ns / invocation.

Another option would be to keep the thread-local approach and use a simpler (non crypto-safe) RNG to have cheaper initialisation and key generation. In the repo for the rand crate there has been some discussion about providing fast weak RNG, which might be a good fit for this purpose: they would be marginally slower than the no-op key copy and still shuffle the keys in a different way for each HashMap

One could say that the incrementing was exactly this sort of non-crypto safe RNG (and is about as fast as you can get). There's two factors at play here:

the key for a hashmap should be hard to deduce externally to protect against HashDoS
the order of keys within a hashmap is undefined, and hence people shouldn't accidentally rely on it (to ensure their code doesn't break when things change slightly)

The first we are assuming (see e.g. #27243) is guaranteed by the combination of 128-bits of randomness plus the design of SipHash, so this discussion is all about the second.

brson · 2016-02-28T00:50:22Z

I'm still a fan of this. @huonw is this ready to go?

alexcrichton · 2016-03-03T18:26:05Z

The libs team discussed this during triage yesterday and the conclusion was:

The speedup wins here seem quite good, so we should land
It would be nice for now to preserve nondeterministic iteration order among all hash maps
We can probably recover this via Cell<u64> in TLS and it's just bumped each time we hit it.

Once that's in place this should be good to go, thanks @huonw!

pczarn · 2016-03-24T11:37:43Z

I suggest preserving nondeterministic iteration order only in debug builds, not release builds. However, there's no rush to change the team's decision.

(It seems the only flag for conditional compilation in debug builds is cfg!(debug_assertions).)

bluss · 2016-03-24T11:47:35Z

alex's solution sounds good. Making it conditional on debug_assertions inside libstd doesn't affect most users: almost all users will use a libstd compiled with debug_assertions off, until recently not even our buildbots would test have such a configuration.

bors · 2016-04-01T08:49:27Z

☔ The latest upstream changes (presumably #32635) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2016-05-01T18:23:00Z

I've started to continue this in #33318 so I'm gonna close this in favor of that

This is a rebase and extension of rust-lang#31356 where we cache the keys in thread local storage. This should give us a nice speed bost in creating hash maps along with mostly retaining the property that all maps have a nondeterministic iteration order. Closes rust-lang#27243

std: Cache HashMap keys in TLS This is a rebase and extension of #31356 where we not only cache the keys in thread local storage but we also bump each key every time a new `HashMap` is created. This should give us a nice speed bost in creating hash maps along with retaining the property that all maps have a nondeterministic iteration order. Closes #27243

rust-highfive assigned brson Feb 2, 2016

brson reviewed Feb 2, 2016
View reviewed changes

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Feb 2, 2016

alexcrichton reviewed Feb 2, 2016
View reviewed changes

huonw force-pushed the hashmap-seeding branch from 27f9142 to 3f84916 Compare February 2, 2016 23:57

huonw added I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 29, 2016

alexcrichton removed I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 3, 2016

alexcrichton mentioned this pull request Apr 14, 2016

Default hashmap implementation can cause processes to block #32953

Closed

alexcrichton mentioned this pull request May 1, 2016

std: Cache HashMap keys in TLS #33318

Merged

alexcrichton closed this May 1, 2016

Veedrac mentioned this pull request Sep 14, 2016

Exposure of HashMap iteration order allows for O(n²) blowup. #36481

Open

tkaitchuck mentioned this pull request Oct 31, 2020

Globset uses inefficient hasher BurntSushi/ripgrep#1717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed HashMaps thread-locally, straight from the OS. #31356

Seed HashMaps thread-locally, straight from the OS. #31356

huonw commented Feb 2, 2016

rust-highfive commented Feb 2, 2016

huonw commented Feb 2, 2016

brson commented Feb 2, 2016

brson Feb 2, 2016

Gankra Feb 2, 2016

huonw Feb 2, 2016

durka Feb 2, 2016

Gankra Feb 2, 2016

huonw Feb 2, 2016

bluss Feb 2, 2016

bluss Feb 2, 2016

brson Feb 2, 2016

huonw Feb 2, 2016

Gankra commented Feb 2, 2016

alexcrichton Feb 2, 2016

huonw Feb 2, 2016

alexcrichton Feb 29, 2016

alexcrichton commented Feb 2, 2016

huonw commented Feb 2, 2016

ranma42 commented Feb 2, 2016

ranma42 commented Feb 2, 2016

huonw commented Feb 2, 2016

ranma42 commented Feb 2, 2016

huonw commented Feb 3, 2016

brson commented Feb 28, 2016

alexcrichton commented Mar 3, 2016

pczarn commented Mar 24, 2016

bluss commented Mar 24, 2016

bors commented Apr 1, 2016

alexcrichton commented May 1, 2016


		pub use self::imp::fill_bytes;

		#[cfg(all(unix, not(target_os = "ios"), not(target_os = "openbsd")))]

Seed HashMaps thread-locally, straight from the OS. #31356

Seed HashMaps thread-locally, straight from the OS. #31356

Conversation

huonw commented Feb 2, 2016

rust-highfive commented Feb 2, 2016

huonw commented Feb 2, 2016

brson commented Feb 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gankra commented Feb 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Feb 2, 2016

huonw commented Feb 2, 2016

ranma42 commented Feb 2, 2016

ranma42 commented Feb 2, 2016

huonw commented Feb 2, 2016

ranma42 commented Feb 2, 2016

huonw commented Feb 3, 2016

brson commented Feb 28, 2016

alexcrichton commented Mar 3, 2016

pczarn commented Mar 24, 2016

bluss commented Mar 24, 2016

bors commented Apr 1, 2016

alexcrichton commented May 1, 2016