-
Notifications
You must be signed in to change notification settings - Fork 679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
30x performance improvement in with_nix_path
#1655
Conversation
src/lib.rs
Outdated
if self.len() >= PATH_MAX as usize { | ||
return Err(Errno::ENAMETOOLONG); | ||
} | ||
|
||
#[allow(clippy::uninit_assumed_init)] // u8s do nothing when dropped | ||
let mut buf: [u8; PATH_MAX as usize] = unsafe { MaybeUninit::uninit().assume_init() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Rust documentation doesn't make it immediately clear that this is sound upon a quick read (and actually seems to imply it isn't), so I'll request more justification in your comment.
From https://doc.rust-lang.org/nightly/core/mem/union.MaybeUninit.html#safety:
It is up to the caller to guarantee that the MaybeUninit really is in an initialized state. Calling this (
.assume_init()
) when the content is not yet fully initialized causes immediate undefined behavior.
It seems like it'd be required to call assume_init()
after populating buf
, but that seems tricky to do without falling back to copy_nonoverlapping
/CStr::from_ptr
, which has performance implications for large paths (see #1583).
The nomicon has some helpful things to say, but even that example demonstrates the use of an array of MaybeUninit
types followed by a transmute
, rather than going directly to the desired type.
TL;DR: This ultimately seems reasonable to me, but I'd like to see a more in-depth justification and also get @asomers thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RalfJung hope you don't mind the ping. :) Can you comment on if creating an assume_init
array of u8
s is legal so long as you write them before reading? And if not, what's the prefered way to let people perform write operations on a MaybeUninit
slice when there's no API that supports it? I've actually had similar use cases before where there appears to be no solution besides marking the whole array as assume_init
. Here's an example where I want to use a RNG's fill_bytes
method:
let mut buf: [u8; 4096] = unsafe { MaybeUninit::uninit().assume_init() };
random.fill_bytes(&mut buf[0..used]);
file.write_all(&buf[0..used])?;
@rtzoeller Agreed, the assume_init
requirements are confusing and seem to contradict each other. For example, they say "The assume_init
is safe because the type we are claiming to have initialized here is a bunch of MaybeUninit
s, which do not require initialization." A u8
also doesn't require initialization, so does that make it too? Hopefully @RalfJung can answer. Otherwise, I'll continue digging into the commit history to see if I can find something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SUPERCILEX can you try the following implementation and report back on performance? It much more directly mirrors the example usage of MaybeUninit
, and in my (admittedly very synthetic) benchmarks seems to perform similarly to your original implementation (which surprised me).
fn with_nix_path<T, F>(&self, f: F) -> Result<T>
where
F: FnOnce(&CStr) -> T,
{
if self.len() >= PATH_MAX as usize {
return Err(Errno::ENAMETOOLONG);
}
// MaybeUninit usage modelled after
// https://doc.rust-lang.org/stable/std/mem/union.MaybeUninit.html#initializing-an-array-element-by-element
let mut buf: [MaybeUninit<u8>; PATH_MAX as usize] =
unsafe { MaybeUninit::uninit().assume_init() };
// copy_from_slice() doesn't work for MaybeUninit types
for (ptr, elt) in buf.iter_mut().zip(self) {
ptr.write(*elt);
}
buf[self.len()].write(0); // NUL terminate the string
let buf = unsafe { std::mem::transmute::<_, [u8; PATH_MAX as usize]>(buf) };
match CStr::from_bytes_with_nul(&buf[..=self.len()]) {
Ok(s) => Ok(f(s)),
Err(_) => Err(Errno::EINVAL),
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not kosher to call assume_init() anything that isn't actually initialized. This is the easy and safe way to implement that function without over-initializing, though it involves a heap allocation. Is it fast enough?
fn with_nix_path<T, F>(&self, f: F) -> Result<T>
where F: FnOnce(&CStr) -> T {
let mut v: Vec<u8> = Vec::from(self);
v.push(0);
match CStr::from_bytes_with_nul(&v) {
Ok(s) => Ok(f(s)),
Err(_) => Err(Errno::EINVAL)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've spent a gazillion hours reading what feels like everything under the sun about uninitialized memory. 😁 Here's my executive summary:
- Simply creating a variable with uninitialized memory for a type where not all bit patterns are valid is UB.
bool
is a good example: in the Rust abstract machine,bool
can only ever be 0 or 1, so simply the act of declaring an uninitializedbool
results in an invalid Rust program. This means the optimizer is free to assume you cannot possibly be in the current state (since a valid Rust program would never be) and therefore you can delete all the code after that variable declaration, insert a panic, etc. - Creating a variable with uninitialized memory for a type where every bit pattern is valid is not UB. In LLVM IR, it will be represented as an
undef
(which is below apoison
which is below UB). That said, it means that the optimizer can pick whatever value makes it happy and does not need to be consistent about its choices. Here's a slightly simplified version of a popular example:Run it: https://rust.godbolt.org/z/cPEa5dnxs#![feature(bench_black_box)] use std::mem; fn always_returns_true(b: u8) -> bool { std::hint::black_box(b == 5) || std::hint::black_box(b != 5) } pub fn main() { let x: u8 = unsafe { mem::MaybeUninit::uninit().assume_init() }; println!("{}", always_returns_true(x)); }
So where does that leave us? Well theoretically speaking, my code has one piece of UB (more accurately, unspecified behavior) and one correct undef
. First, it creates a reference (buf[..self.len()]
) that points to uninitialized memory, but references must always point to initialized memory. Therefore, not a valid Rust program. Second, we have a bunch of u8's that are undef
. However, since we never read them before writing, we cannot observe their undef
behavior so we don't care.
Now what about from a pragmatic perspective? Neither cases have been specified: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/active_discussion/validity.md. So really, people are already using references "incorrectly" (even in the stdlib IIRC) so it's unlikely that this model will break anytime soon.
That said, I figured out a way to rewrite this with no UB that I can tell. We use an initialized raw pointer. Raw pointers can be whatever the heck we want, it's only when they are dereferenced that the data they point to must be valid. The only part I'm still a little iffy on is buf.assume_init_ref()[..=self.len()]
: it creates a reference to an array that is partially initialized, then immediately creates another reference to the part of the array that is initialized. I think that's fine, but if we really wanted to go crazy, we could stick with raw pointers and create the slice directly: slice::from_raw_parts(buf, self.len() + 1)
(though this is ~1M instructions slower for whatever reason).
With the new implementation, we get even faster which is cool:
20,704,687 (15.32%) 4,384,493 (14.41%) 2,744,994 (13.52%) 134 ( 1.23%) 13 ( 0.04%) 265 ( 1.72%) 8 ( 0.15%) 0 45 ( 0.37%) 2,441,283 (12.30%) 1,160 ( 0.23%) 609,972 (26.51%) 4 ( 0.04%) => /home/asaveau/Desktop/nix/src/lib.rs:<[u8] as nix::NixPath>::with_nix_path (152,493x)
@rtzoeller your implementation benchmarks similarly in terms of time taken, but suffers significantly on the instruction count side:
180,403,130 (61.18%) 56,855,699 (68.57%) 39,502,833 (69.22%) 590 ( 4.24%) 5,281 (12.19%) 7,548 (32.57%) 12 ( 0.22%) 701 (14.96%) 1,069 ( 8.01%) 18,761,972 (51.87%) 305,759 (37.73%) 6,101,922 (78.29%) 1,513 (14.69%) => /home/asaveau/Desktop/nix/src/lib.rs:<[u8] as nix::NixPath>::with_nix_path (152,496x)
@asomers I'm very strongly opposed to creating an allocation per syscall. The whole point of this PR is to be completely zero-cost. Interestingly enough, allocating is not the worst on the instruction count front:
48,197,762 (29.65%) 13,229,886 (33.70%) 8,234,742 (31.94%) 135 ( 1.23%) 34 ( 0.11%) 9 ( 0.06%) 7 ( 0.13%) 0 0 6,253,608 (26.44%) 2,324 ( 0.46%) 1,524,930 (47.41%) 15 ( 0.20%) => /home/asaveau/Desktop/nix/src/lib.rs:<[u8] as nix::NixPath>::with_nix_path (152,493x)
Though when I ran the benchmarks from #1583, it performed significantly worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a variable with uninitialized memory for a type where every bit pattern is valid is not UB.
That is not correct -- the Rust Reference says this is UB, too. Indeed, undef
is not just "some unknown bit pattern", it is a value that is clearly distinct from any single 'regular' bit pattern, and u8
is well-defined for any regular bit pattern but not for undef
.
let mut buf: [u8; 4096] = unsafe { MaybeUninit::uninit().assume_init() };
is hence UB.
Here's an example where I want to use a RNG's fill_bytes method:
The current Write
trait does not adequately support writing into an uninitialized buffer; this RFC describes the planned changes to make it compatible. Other, similar APIs should probably use the same pattern.
When interfacing with C code that does in-place initialization, the suggested pattern is to use raw pointers only until data is actually initialized. (I currently can't dig into the details of what this particular PR does so I hope at least some of these remarks are useful... if you have a self-contained question then I can also try to answer that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not correct -- the Rust Reference says this is UB, too.
Isn't that still unspecified though? Or at least, I thought the validity discussions were about deciding whether or not it should be considered UB for Rust. From the LLVM perspective, an undef
u8 is defined (as far as I understand). For example, undefu8 | 3 = uuuuuu11
. Those u
s can be whatever and change whenever, but we know that no matter what LLVM chooses for undef, the lower two bits must be 1. That's what I got from these slides.
this RFC describes the planned changes to make it compatible.
That's freakin' awesome! Glad to see progress. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed no UB on the LLVM level here, but you are not writing LLVM IR -- you are writing Rust code, so the relevant rules for UB are the ones for Rust, not LLVM.
And for now, Rust is pretty clear in this regard. It is true that there is discussion here for what the final rules should be (rust-lang/unsafe-code-guidelines#71), but the only reason we can have that discussion is that currently this is declared UB -- so we can allow more code later, depending on the outcome of the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhhh, that makes perfect sense. Thank you for all the feedback!
with_nix_path
with_nix_path
@SUPERCILEX the changes look good to me! Before merging, a few procedural things need to be addressed:
|
Sweet!
|
- Update nix with my PRs: nix-rust/nix#1656, nix-rust/nix#1655 - Use raw pointer manipulation to extract name cache values - Use a faster rand implementation with better statistical guarantees - Tune the task queue capacity and consumption rate and get rid of copies in drain (by using a VecDeque) Signed-off-by: Alex Saveau <[email protected]>
@SUPERCILEX apologies for the delay in getting back to this. My preference would be to avoid submitting bulk formatting changes for now - I'd prefer if those could wait until #1657 is submitted, which I'd like to hold off on for now. We use bors-ng for merging and validating PRs, and do not have it configured to squash changes. Consequently these changes will need to be manually squashed. I am not particularly worried about losing the history, as most of the intermediate changes were unsound. |
Signed-off-by: Alex Saveau <[email protected]>
Makes sense, done! |
bors r+ |
1655: 30x performance improvement in `with_nix_path` r=rtzoeller a=SUPERCILEX I've been digging into CPU instructions counts and found that `nix` accounted for an eye-watering 85% of my program's instruction counts (yes, I do a lot of I/O, I know). The fix is simple: don't initialize the stack memory since we're just going to overwrite it anyway. > Note: I also ran rustfmt in a separate commit, not sure if that's ok. ### Before ``` 650,398,225 (85.05%) 5,451,056 (17.31%) 627,714,969 (97.28%) 60 ( 0.54%) 22 ( 0.07%) 3,997 (20.98%) 10 ( 0.18%) 0 752 ( 5.98%) 627,716,244 (97.30%) 267,333 (34.62%) 914,814 (35.11%) 105 ( 0.73%) => /home/asaveau/Desktop/nix/src/lib.rs:<[u8] as nix::NixPath>::with_nix_path (152,469x) ``` ``` 1,677,159 ( 0.22%) 0 762,345 ( 0.12%) 14 ( 0.13%) 0 0 2 ( 0.04%) . . . . . . fn with_nix_path<T, F>(&self, f: F) -> Result<T> 2,287,035 ( 0.30%) 304,938 ( 0.97%) 152,469 ( 0.02%) 5 ( 0.04%) 0 0 0 0 0 304,938 ( 0.05%) . . . => ???:__rust_probestack (152,469x) . . . . . . . . . . . . . where . . . . . . . . . . . . . F: FnOnce(&CStr) -> T, . . . . . . . . . . . . . { 457,407 ( 0.06%) 152,469 ( 0.48%) 152,469 ( 0.02%) 0 0 93 ( 0.49%) 0 0 16 ( 0.13%) 0 0 152,469 ( 5.85%) 97 ( 0.67%) let mut buf = [0u8; PATH_MAX as usize]; 627,104,997 (82.00%) 304,938 ( 0.97%) 624,513,024 (96.78%) 1 ( 0.01%) 0 3,814 (20.02%) 1 ( 0.02%) 0 720 ( 5.72%) 625,122,900 (96.90%) 152,495 (19.75%) . . => ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms (152,469x) . . . . . . . . . . . . . 304,938 ( 0.04%) 0 0 0 0 0 0 0 0 152,469 ( 0.02%) 2 ( 0.00%) . . if self.len() >= PATH_MAX as usize { . . . . . . . . . . . . . return Err(Errno::ENAMETOOLONG); . . . . . . . . . . . . . } . . . . . . . . . . . . . 1,067,283 ( 0.14%) 152,469 ( 0.48%) 152,469 ( 0.02%) 1 ( 0.01%) 4 ( 0.01%) 0 1 ( 0.02%) 0 0 0 0 152,469 ( 5.85%) 1 ( 0.01%) buf[..self.len()].copy_from_slice(self); 3,202,541 ( 0.42%) 914,814 ( 2.91%) 609,876 ( 0.09%) 1 ( 0.01%) 2 ( 0.01%) 0 0 0 0 458,005 ( 0.07%) 114,562 (14.84%) 152,469 ( 5.85%) . => /rustc/21b4a9cfdcbb1e76f4b36b5c3cfd64d627285093/library/core/src/slice/mod.rs:core::slice::<impl [T]>::copy_from_slice (152,469x) 762,345 ( 0.10%) 304,938 ( 0.97%) 152,469 ( 0.02%) 0 6 ( 0.02%) 0 0 0 0 152,469 ( 0.02%) 6 ( 0.00%) 152,469 ( 5.85%) 1 ( 0.01%) match CStr::from_bytes_with_nul(&buf[..=self.len()]) { 8,350,190 ( 1.09%) 1,181,828 ( 3.75%) 1,067,283 ( 0.17%) 19 ( 0.17%) 2 ( 0.01%) 87 ( 0.46%) 3 ( 0.05%) 0 16 ( 0.13%) 1,220,525 ( 0.19%) 167 ( 0.02%) 152,469 ( 5.85%) 5 ( 0.03%) => /rustc/21b4a9cfdcbb1e76f4b36b5c3cfd64d627285093//library/std/src/ffi/c_str.rs:std::ffi::c_str::CStr::from_bytes_with_nul (152,469x) 609,876 ( 0.08%) 609,876 ( 1.94%) 0 0 1 ( 0.00%) . . . . . . . . Ok(s) => Ok(f(s)), . . . . . . . . . . . . . Err(_) => Err(Errno::EINVAL), . . . . . . . . . . . . . } 914,814 ( 0.12%) 762,345 ( 2.42%) . . . . . . . . . . . } . . . . . . . . . . . . . } ``` ### After ``` 21,462,416 (15.81%) 4,688,455 (15.27%) 2,896,847 (14.17%) 74 ( 0.64%) 11 ( 0.04%) 249 ( 1.58%) 8 ( 0.15%) 0 48 ( 0.38%) 2,593,200 (12.98%) 1,128 ( 0.22%) 762,305 (31.08%) 158 ( 1.32%) => /home/asaveau/Desktop/nix/src/lib.rs:<[u8] as nix::NixPath>::with_nix_path (152,461x) ``` ``` 1,067,227 ( 0.79%) 0 609,844 ( 2.98%) 1 ( 0.01%) 0 0 1 ( 0.02%) . . . . . . fn with_nix_path<T, F>(&self, f: F) -> Result<T> 2,286,915 ( 1.68%) 304,922 ( 0.99%) 152,461 ( 0.75%) 19 ( 0.16%) 0 0 0 0 0 304,922 ( 1.53%) . . . => ???:__rust_probestack (152,461x) . . . . . . . . . . . . . where . . . . . . . . . . . . . F: FnOnce(&CStr) -> T, . . . . . . . . . . . . . { 304,922 ( 0.22%) 0 0 0 0 0 0 0 0 152,461 ( 0.76%) 6 ( 0.00%) . . if self.len() >= PATH_MAX as usize { . . . . . . . . . . . . . return Err(Errno::ENAMETOOLONG); . . . . . . . . . . . . . } . . . . . . . . . . . . . . . . . . . . . . . . . . let mut buf = MaybeUninit::<[u8; PATH_MAX as usize]>::uninit(); . . . . . . . . . . . . . let buf_ptr = buf.as_mut_ptr() as *mut u8; . . . . . . . . . . . . . . . . . . . . . . . . . . unsafe { . . . . . . . . . . . . . ptr::copy_nonoverlapping(self.as_ptr(), buf_ptr, self.len()); . . . . . . . . . . . . . buf_ptr.add(self.len()).write(0); . . . . . . . . . . . . . } . . . . . . . . . . . . . 1,067,227 ( 0.79%) 304,922 ( 0.99%) 152,461 ( 0.75%) 1 ( 0.01%) 3 ( 0.01%) 0 1 ( 0.02%) 0 0 304,922 ( 1.53%) 4 ( 0.00%) 152,461 ( 6.22%) 1 ( 0.01%) match CStr::from_bytes_with_nul(unsafe { slice::from_raw_parts(buf_ptr, self.len() + 1) }) { 8,349,726 ( 6.15%) 1,181,764 ( 3.85%) 1,067,227 ( 5.22%) 18 ( 0.15%) 1 ( 0.00%) 83 ( 0.53%) 2 ( 0.04%) 0 16 ( 0.13%) 1,220,453 ( 6.11%) 158 ( 0.03%) 152,461 ( 6.22%) 4 ( 0.03%) => /rustc/21b4a9cfdcbb1e76f4b36b5c3cfd64d627285093//library/std/src/ffi/c_str.rs:std::ffi::c_str::CStr::from_bytes_with_nul (152,461x) 609,844 ( 0.45%) 609,844 ( 1.99%) . . . . . . . . . . . Ok(s) => Ok(f(s)), . . . . . . . . . . . . . Err(_) => Err(Errno::EINVAL), . . . . . . . . . . . . . } 762,305 ( 0.56%) 609,844 ( 1.99%) 0 1 ( 0.01%) 0 0 1 ( 0.02%) . . . . . . } . . . . . . . . . . . . . } ``` Co-authored-by: Alex Saveau <[email protected]>
Build failed: |
bors retry |
🔒 Permission denied Existing reviewers: click here to make SUPERCILEX a reviewer |
Guessing a retry is needed |
bors retry |
Reduce CString allocations in std as much as possible Currently, every operation involving paths in `fs` allocates memory to hold the path before sending it through the syscall. This PR instead uses a stack allocation (chosen size is somewhat arbitrary) when the path is short before falling back to heap allocations for long paths. Benchmarks show that the stack allocation is ~2x faster for short paths: ``` test sys::unix::fd::tests::bench_heap_path_alloc ... bench: 34 ns/iter (+/- 2) test sys::unix::fd::tests::bench_stack_path_alloc ... bench: 15 ns/iter (+/- 1) ``` For long paths, I couldn't find any measurable difference. --- I'd be surprised if I was the first to think of this, so I didn't fully flush out the PR. If this change is desirable, I'll make use of `run_with_cstr` across all platforms in every fs method (currently just unix open for testing). I also added an `impl From<FromBytesWithNulError>` which is presumably a no-no (or at least needs to be done in another PR). --- Also see nix-rust/nix#1655 with a bunch of discussion where I'm doing something similar.
Reduce CString allocations in std as much as possible Currently, every operation involving paths in `fs` allocates memory to hold the path before sending it through the syscall. This PR instead uses a stack allocation (chosen size is somewhat arbitrary) when the path is short before falling back to heap allocations for long paths. Benchmarks show that the stack allocation is ~2x faster for short paths: ``` test sys::unix::fd::tests::bench_heap_path_alloc ... bench: 34 ns/iter (+/- 2) test sys::unix::fd::tests::bench_stack_path_alloc ... bench: 15 ns/iter (+/- 1) ``` For long paths, I couldn't find any measurable difference. --- I'd be surprised if I was the first to think of this, so I didn't fully flush out the PR. If this change is desirable, I'll make use of `run_with_cstr` across all platforms in every fs method (currently just unix open for testing). I also added an `impl From<FromBytesWithNulError>` which is presumably a no-no (or at least needs to be done in another PR). --- Also see nix-rust/nix#1655 with a bunch of discussion where I'm doing something similar.
Reduce CString allocations in std as much as possible Currently, every operation involving paths in `fs` allocates memory to hold the path before sending it through the syscall. This PR instead uses a stack allocation (chosen size is somewhat arbitrary) when the path is short before falling back to heap allocations for long paths. Benchmarks show that the stack allocation is ~2x faster for short paths: ``` test sys::unix::fd::tests::bench_heap_path_alloc ... bench: 34 ns/iter (+/- 2) test sys::unix::fd::tests::bench_stack_path_alloc ... bench: 15 ns/iter (+/- 1) ``` For long paths, I couldn't find any measurable difference. --- I'd be surprised if I was the first to think of this, so I didn't fully flush out the PR. If this change is desirable, I'll make use of `run_with_cstr` across all platforms in every fs method (currently just unix open for testing). I also added an `impl From<FromBytesWithNulError>` which is presumably a no-no (or at least needs to be done in another PR). --- Also see nix-rust/nix#1655 with a bunch of discussion where I'm doing something similar.
I've been digging into CPU instructions counts and found that
nix
accounted for an eye-watering 85% of my program's instruction counts (yes, I do a lot of I/O, I know).The fix is simple: don't initialize the stack memory since we're just going to overwrite it anyway.
Before
After