random_get_pseudo_bytes() need not provide cryptographic strength entropy #372

ryao · 2014-07-12T00:24:48Z

Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a pseudo-random number generator proposed in a
paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://arxiv.org/pdf/1402.6246v2.pdf

Profiling the same benchmark with a variant of this patch that did not
disable interrupts showed that time spent in random_get_pseudo_bytes()
dropped to 0.06%. That is a factor of 50 improvement.

ryao · 2014-07-12T00:52:10Z

I decided to add the current thread's pid as an entropy source in order to avoid the situation where two threads operating in lockstep get the same numbers. I will update the documentation after performing another round of performance testing so that my figure for performance improvement is accurate.

behlendorf · 2014-07-16T20:13:56Z

@ryao Nice find on this. This looks like a great improvement and I've just got minor style nits. Are you happy with this patch, you mentioned you wanted to do a few updates.

ryao · 2014-07-16T20:57:19Z

@behlendorf It works well on my system, but there are a few minor things to fix in terms of comments and style. There is also a mistake where I left a variable static when splitting out the generator into its own function. I will push a corrected version later today.

ryao · 2014-07-16T21:30:51Z

@behlendorf I have pushed a revised patch. It addresses everything that we discussed. Sadly, my revised benchmark showed lower performance, but I think we spend so little time in this function that it is in the margin of error.

ryao · 2014-07-18T00:07:51Z

To be clear, this is still faster than the Linux function we wrap now, but the improvement is only a factor of 60, rather than the 158 I reported earlier. The previous figure also had far too many significant figures.

ryao · 2014-07-21T05:00:58Z

@behlendorf I have pushed what I expect will be the final revision. I decided to switch to per-cpu variables and turn on/off preemption. This should perfectly correspond to what was stated in the paper.

behlendorf · 2014-07-21T18:03:12Z

@ryao This is turning out nice! Just a few questions on the latest version.

behlendorf · 2014-07-21T19:02:01Z

@ryao OK, let me run this through testing. As long as nothing unexpected happens we can get it merged.

ryao · 2014-07-22T16:48:05Z

@behlendorf I have pushed a version that includes code to detect and correct the incredibly improbable case that get_random_bytes() returns a zero word. The code will set the word to ~0 - i where i is the CPU id and log the issue to dmesg. This should now be safe to merge.

I considered your suggestion to call get_random_bytes() additional times, but I think implicitly mapping CPU 0, 1 ... 4095 to -1, -2, ... -4096 as a fallback is a simple way of dealing with it. That way we won't make a situation where we are entropy starved worse. It is tempting to hard code the initial seeds to avoid using get_random_bytes() altogether so that this would be the initial seed rather than a fallback for the initial seed, but I decided against it because that would have the undesireable effect of making GUID collisions more probable. It is also tempting to try to substitute jiffies and some hard coded increments, but that similarly has the same problem.

As a side note, it should be said that this has a side effect for GUID generation. Ordinarily, we should have 2^128 different possibilities for a GUID, but with this PRNG being used to generate the numbers, we have the paradoxical situation where we only have 2^64 - 1 possibilities for GUIDs. This is because we have a sequence of 2^64 - 1 64-bit numbers and picking one number from the sequence implicitly picks the next. That should be okay, but it is worth mentioning. If that is undesirable, we might want to change the GUID generation to use random_get_bytes() so that it continues using Linux's get_random_bytes().

behlendorf · 2014-07-22T16:49:21Z

@ryao Thanks for refreshing this again. This looks good to me, are you happy with the latest version.

behlendorf · 2014-07-22T17:17:16Z

Good point about the GUIDs. That should be acceptable behavior. I don't think it would be too unexpected since the caller did decide to use a PRNG. If it becomes a problem we can update the zfs code to use random_get_bytes().

ryao · 2014-07-22T17:36:55Z

@behlendorf I have revised this one more time to improve the comments and I am happy with the present code. The only lingering questions on my mind are if this merits the addition of my LLC's copyright notice and if the new code should be split out into a separate file (e.g. spl-random.c), but those are minor details.

behlendorf · 2014-07-22T17:47:46Z

@ryao OK, then I'll review it one more time and get it tested.

ryao · 2014-07-22T18:35:12Z

Just to document what was said on IRC here... it turns out that random_get_pseudo_bytes() does map to /dev/urandom on Solaris. That means that our present mapping is accurate, but it has high overhead. No SPL consumer needs cryptography strength entropy, so it is safe to switch to a fast PRNG to reduce overhead. I have updated the commit's comments to reflect this. Only the comments and commit message have changed since my last revision.

…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.

sempervictus · 2015-03-02T00:50:26Z

Any chance this could get updated for current revision? This still has a lot of the code cleaned up last fall attached. Thanks

…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.

ryao · 2016-02-06T22:10:01Z

@sempervictus This was near the bottom of my priorities, but it is refreshed now.

ryao · 2016-02-07T02:07:17Z

I have pushed a patch into the pull request to replace the 64-bit xorshift generator with a newer 128-bit xorshift+ generator that I found in literature:

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

The numbers produced by it pass more stringent tests for statistical randomness than those that I existed when I familiarized myself with the topic of random number generation almost a decade ago. It also fixes a few problems:

128-bit GUID generated by this generator are now truly 128 bits, not 64 bits disguised as 128 bits.
The sequences generated on two different cpus will never overlap in practice because each CPU's seed is 2^64 numbers apart. With the previous code, the probability was ~0.00015% on a system with 8 CPUs assuming 2^32 numbers are produced on each. If they did overlap we would probably have been okay, but it is nice to eliminate the risk of overlap entirely.
The entropy required from get_random_bytes() at system initialization is now O(1) rather than O(N). More precisely, it is 16 bytes rather than NR_CPUS * 8 bytes.

It is also very fast. The following website claims that it takes 1.12 ns per 64bit word on an Intel Core i7-4770 in a presumably single threaded benchmark. That is ~7.14GB/s per core.

http://xorshift.di.unimi.it/#speed

The design of random_get_pseudo_bytes() should prevent it from obtaining that level of speed in practice, but it should still be very fast. It should also outperform Linux's get_random_bytes() by one or two orders of magnitude like the original xorshift generator I had proposed did in benchmarks.

Note that while this PR looks right to me, I have not yet subjected it to tests like I did with the original version. I will test it at some point if no one volunteers to do it before I do.

dweeezil · 2016-02-07T15:10:12Z

@ryao When I tried a simple "dd to a zvol" test as you mentioned in the original issue report, I noticed the hot call path (3.X% as well) was the vd == NULL case in vdev_mirror_map_alloc(). Was this the path you observed? It seems this case could easily get away with (the equivalent of) gethrtime() % 3 rather than relying on a proper PRNG. Of course it's still a very Good Idea to have as efficient a PRNG as possible for, say, GUID generation etc.

ryao · 2016-02-08T17:47:24Z

@dweeezil That is consistent with my recollection.

Something like gethrtime() % 3 could work and looks like less trouble on the surface, but I consider it to be more trouble than this per-cpu PRNG for the following reasons:

Switching to gethrtime() % 3 in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow random_get_pseudo_bytes() in different hot code paths. Reimplementing random_get_pseudo_bytes() with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future.
Looking at the code that implements gethrtime(), I think it is unlikely to be faster than this per-CPU PRNG implementation of random_get_pseudo_bytes(). It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective.
gethrtime() % 3 can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in random_get_pseudo_bytes() that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for gethrtime() % 3.
gethrtime() uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing random_get_pseudo_bytes() with gethrtime() in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions.
I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use gethrtime() % 3 bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from gethrtime() % 3.
gethrtime() calls getrawmonotonic(), which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, but that should not be a problem. At the high end (i.e. NR_CPU == 4096), we are talking about 64KB of memory for seeds. At the low end (i.e. NR_CPU == 1), we would be using 16 bytes of memory for the seed. In either case, we should only use a few hundred bytes of code for text, especially since spl_rand_jump() is intended to be inlined into spl_random_init(), which should be removed during early boot. In either case, the memory requirements are minuscule compared to the rest of ZoL.

That said, I like the idea of improving our random_get_pseudo_bytes() implementation in a way that present/future code using it should never need to be examined again. As far as I can tell, this code does exactly that.

ryao · 2016-02-08T19:31:15Z

I merged the two into one commit with some typo fixes and an updated commit message.

behlendorf · 2016-02-08T19:53:11Z

@ryao if you open a dummy PR against ZFS which includes the following line in the commit message we can get some testing on this.

Requires-spl: refs/pull/372/head

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

ryao · 2016-02-08T20:15:05Z

@behlendorf I did a quick build test locally, which revealed that we had a C99ism (error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode) from the CC0 code that needed a repush to fix.

I have opened openzfs/zfs#4321 with a dummy commit for the buildbot.

dweeezil · 2016-02-08T21:08:29Z

@ryao All your points regarding gethrtime() are well taken. Since my comments regarding the use of a PRNG in ```vdev_mirror_map_alloc() could be considered an issue hijack, I'll save any others for the OpenZFS list or a separate ZFS-specific issue.

ryao · 2016-02-08T21:39:41Z

@dweeezil This was originally motivated by the excessive CPU time taken in vdev_mirror_map_alloc(). When I originally looked at this, I felt that modifying either vdev_mirror_map_alloc() or spa_get_random(c) would be unnecessary if random_get_pseudo_bytes() were updated. It got bumped to a low priority by my then new job at ClusterHQ, but I want to finish what is needed for this to be merged.

…ropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <[email protected]>

ryao · 2016-02-08T22:29:10Z

I made a small cosmetic change to the patch so that "improbable seed" would be printed as 0x696D70726F6261626C65207365656400 on both big endian and little endian systems.

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

ryao · 2016-02-17T02:17:06Z

This has passed the buildbot. I only had to kick it 3 times before the spurious failures stopped. :/

behlendorf · 2016-02-17T17:46:24Z

This LGTM, @dweeezil if you're also OK with this change I'll get it merged now that it's passing all the testing.

dweeezil · 2016-02-17T18:20:35Z

LGTM, including the rationale explained in the commit comment.

behlendorf added this to the 0.6.4 milestone Jul 16, 2014

behlendorf added the Bug label Jul 16, 2014

ryao changed the title ~~random_get_pseudo_bytes() should not provide cryptographic strength entr...~~ random_get_pseudo_bytes() need not provide cryptographic strength entropy Jul 22, 2014

behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 5, 2015

kernelOfTruth mentioned this pull request Mar 15, 2015

After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit - slowing down system to a halt openzfs/zfs#3157

Closed

ryao force-pushed the random_get_pseudo_bytes branch from 8a6998f to 3c00cf1 Compare February 6, 2016 22:09

ryao force-pushed the random_get_pseudo_bytes branch from 8e199ae to 3e4bef8 Compare February 7, 2016 01:46

ryao force-pushed the random_get_pseudo_bytes branch 3 times, most recently from 6800e59 to 21150e4 Compare February 7, 2016 01:55

ryao force-pushed the random_get_pseudo_bytes branch 3 times, most recently from a5ad06a to a2e1f4b Compare February 7, 2016 05:50

ryao force-pushed the random_get_pseudo_bytes branch from a2e1f4b to 5336d38 Compare February 8, 2016 19:30

ryao force-pushed the random_get_pseudo_bytes branch from 5336d38 to 632b008 Compare February 8, 2016 20:11

ryao added a commit to ryao/zfs that referenced this pull request Feb 8, 2016

Test openzfs/spl#372

9a5bc0d

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

ryao force-pushed the random_get_pseudo_bytes branch from 632b008 to 5c15937 Compare February 8, 2016 22:26

ryao added a commit to ryao/zfs that referenced this pull request Feb 10, 2016

Test openzfs/spl#372

9971fb7

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

ryao added a commit to ryao/zfs that referenced this pull request Feb 13, 2016

Test openzfs/spl#372

2f19c7c

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

ryao added a commit to ryao/zfs that referenced this pull request Feb 16, 2016

Test openzfs/spl#372

ab9d2bd

Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>

behlendorf closed this in 0b43696 Feb 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random_get_pseudo_bytes() need not provide cryptographic strength entropy #372

random_get_pseudo_bytes() need not provide cryptographic strength entropy #372

ryao commented Jul 12, 2014

ryao commented Jul 12, 2014

behlendorf commented Jul 16, 2014

ryao commented Jul 16, 2014

ryao commented Jul 16, 2014

ryao commented Jul 18, 2014

ryao commented Jul 21, 2014

behlendorf commented Jul 21, 2014

behlendorf commented Jul 21, 2014

ryao commented Jul 22, 2014

behlendorf commented Jul 22, 2014

behlendorf commented Jul 22, 2014

ryao commented Jul 22, 2014

behlendorf commented Jul 22, 2014

ryao commented Jul 22, 2014

sempervictus commented Mar 2, 2015

ryao commented Feb 6, 2016

ryao commented Feb 7, 2016

dweeezil commented Feb 7, 2016

ryao commented Feb 8, 2016

ryao commented Feb 8, 2016

behlendorf commented Feb 8, 2016

ryao commented Feb 8, 2016

dweeezil commented Feb 8, 2016

ryao commented Feb 8, 2016

ryao commented Feb 8, 2016

ryao commented Feb 17, 2016

behlendorf commented Feb 17, 2016

dweeezil commented Feb 17, 2016

random_get_pseudo_bytes() need not provide cryptographic strength entropy #372

random_get_pseudo_bytes() need not provide cryptographic strength entropy #372

Conversation

ryao commented Jul 12, 2014

ryao commented Jul 12, 2014

behlendorf commented Jul 16, 2014

ryao commented Jul 16, 2014

ryao commented Jul 16, 2014

ryao commented Jul 18, 2014

ryao commented Jul 21, 2014

behlendorf commented Jul 21, 2014

behlendorf commented Jul 21, 2014

ryao commented Jul 22, 2014

behlendorf commented Jul 22, 2014

behlendorf commented Jul 22, 2014

ryao commented Jul 22, 2014

behlendorf commented Jul 22, 2014

ryao commented Jul 22, 2014

sempervictus commented Mar 2, 2015

ryao commented Feb 6, 2016

ryao commented Feb 7, 2016

dweeezil commented Feb 7, 2016

ryao commented Feb 8, 2016

ryao commented Feb 8, 2016

behlendorf commented Feb 8, 2016

ryao commented Feb 8, 2016

dweeezil commented Feb 8, 2016

ryao commented Feb 8, 2016

ryao commented Feb 8, 2016

ryao commented Feb 17, 2016

behlendorf commented Feb 17, 2016

dweeezil commented Feb 17, 2016