-
Notifications
You must be signed in to change notification settings - Fork 178
random_get_pseudo_bytes() need not provide cryptographic strength entropy #372
Conversation
I decided to add the current thread's pid as an entropy source in order to avoid the situation where two threads operating in lockstep get the same numbers. I will update the documentation after performing another round of performance testing so that my figure for performance improvement is accurate. |
@ryao Nice find on this. This looks like a great improvement and I've just got minor style nits. Are you happy with this patch, you mentioned you wanted to do a few updates. |
@behlendorf It works well on my system, but there are a few minor things to fix in terms of comments and style. There is also a mistake where I left a variable static when splitting out the generator into its own function. I will push a corrected version later today. |
@behlendorf I have pushed a revised patch. It addresses everything that we discussed. Sadly, my revised benchmark showed lower performance, but I think we spend so little time in this function that it is in the margin of error. |
To be clear, this is still faster than the Linux function we wrap now, but the improvement is only a factor of 60, rather than the 158 I reported earlier. The previous figure also had far too many significant figures. |
@behlendorf I have pushed what I expect will be the final revision. I decided to switch to per-cpu variables and turn on/off preemption. This should perfectly correspond to what was stated in the paper. |
@ryao This is turning out nice! Just a few questions on the latest version. |
@ryao OK, let me run this through testing. As long as nothing unexpected happens we can get it merged. |
@behlendorf I have pushed a version that includes code to detect and correct the incredibly improbable case that I considered your suggestion to call As a side note, it should be said that this has a side effect for GUID generation. Ordinarily, we should have |
@ryao Thanks for refreshing this again. This looks good to me, are you happy with the latest version. |
Good point about the GUIDs. That should be acceptable behavior. I don't think it would be too unexpected since the caller did decide to use a PRNG. If it becomes a problem we can update the zfs code to use |
@behlendorf I have revised this one more time to improve the comments and I am happy with the present code. The only lingering questions on my mind are if this merits the addition of my LLC's copyright notice and if the new code should be split out into a separate file (e.g. spl-random.c), but those are minor details. |
@ryao OK, then I'll review it one more time and get it tested. |
Just to document what was said on IRC here... it turns out that |
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
Any chance this could get updated for current revision? This still has a lot of the code cleaned up last fall attached. Thanks |
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
…ropy openzfs#372 Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://arxiv.org/pdf/1402.6246v2.pdf Profiling the same benchmark with a variant of this patch that did not disable interrupts showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement.
8a6998f
to
3c00cf1
Compare
@sempervictus This was near the bottom of my priorities, but it is refreshed now. |
8e199ae
to
3e4bef8
Compare
6800e59
to
21150e4
Compare
I have pushed a patch into the pull request to replace the 64-bit xorshift generator with a newer 128-bit xorshift+ generator that I found in literature: http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf The numbers produced by it pass more stringent tests for statistical randomness than those that I existed when I familiarized myself with the topic of random number generation almost a decade ago. It also fixes a few problems:
It is also very fast. The following website claims that it takes 1.12 ns per 64bit word on an Intel Core i7-4770 in a presumably single threaded benchmark. That is ~7.14GB/s per core. http://xorshift.di.unimi.it/#speed The design of Note that while this PR looks right to me, I have not yet subjected it to tests like I did with the original version. I will test it at some point if no one volunteers to do it before I do. |
a5ad06a
to
a2e1f4b
Compare
@ryao When I tried a simple "dd to a zvol" test as you mentioned in the original issue report, I noticed the hot call path (3.X% as well) was the |
@dweeezil That is consistent with my recollection. Something like
https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, but that should not be a problem. At the high end (i.e. That said, I like the idea of improving our |
a2e1f4b
to
5336d38
Compare
I merged the two into one commit with some typo fixes and an updated commit message. |
@ryao if you open a dummy PR against ZFS which includes the following line in the commit message we can get some testing on this.
|
5336d38
to
632b008
Compare
Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>
@behlendorf I did a quick build test locally, which revealed that we had a C99ism ( I have opened openzfs/zfs#4321 with a dummy commit for the buildbot. |
@ryao All your points regarding |
@dweeezil This was originally motivated by the excessive CPU time taken in |
…ropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <[email protected]>
632b008
to
5c15937
Compare
I made a small cosmetic change to the patch so that "improbable seed" would be printed as |
Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>
Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>
Requires-spl: refs/pull/372/head Signed-off-by: Richard Yao <[email protected]>
This has passed the buildbot. I only had to kick it 3 times before the spurious failures stopped. :/ |
This LGTM, @dweeezil if you're also OK with this change I'll get it merged now that it's passing all the testing. |
LGTM, including the rationale explained in the commit comment. |
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.
The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a pseudo-random number generator proposed in a
paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.
http://arxiv.org/pdf/1402.6246v2.pdf
Profiling the same benchmark with a variant of this patch that did not
disable interrupts showed that time spent in random_get_pseudo_bytes()
dropped to 0.06%. That is a factor of 50 improvement.