-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grow RawVec
to fill the allocator bins tighter [please bench it]
#101341
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
b114349
to
ee15ce6
Compare
I'm curious to see benchmarks of this :) |
This comment has been minimized.
This comment has been minimized.
ee15ce6
to
f07a5cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta-comment: Maybe it'd be worth adjusting this by the size of the allocation too? For an allocation of a few dozen megabytes, it's not so obvious the rounding up to a power of two is necessary, as they might just get pages from the OS for the actual size, rather than bucketing.
library/alloc/src/raw_vec.rs
Outdated
// FIXME: maybe two pointers to be on the safe side? It could potentially be platform-dependent. | ||
let aligned_alloc_size = bin_size.saturating_sub(mem::align_of::<usize>()); | ||
|
||
// Ignore cases where the value is actually smaller, due to substractiing fixed overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me notice that the current approach is basically
let new_cap = max![ len + additional, cap * 2, MIN_CAP ];
so I wonder if it might make sense to just make this another thing in that list, or even replacing the cap * 2
.
sizeof(usize).div_ceil(sizeof(T))
could be a constant like MIN_NON_ZERO_CAP
is, so it's easy to subtract off even in length units instead of needing to switch to byte units here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of needing to switch to byte units here
The byte units are used because what we're targeting is a byte size of (just under) a power of two. We can const
provide "usize
per T
" and "T
per usize
", but I don't see how to calculate (size_of::<T>() * cap).next_power_of_two() / size_of::<T>()
without scaling cap
into byte units instead of T
units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right! Good point.
library/alloc/src/raw_vec.rs
Outdated
let bin_size = usize::MAX >> (alloc_size.leading_zeros() - 1); | ||
|
||
// Leave some room for allocators that add fixed overhead (usually one pointer-size) | ||
// FIXME: maybe two pointers to be on the safe side? It could potentially be platform-dependent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a good point. This change would massively backfire if it underestimates the overhead, as the underlying allocator would then need to massively overallocate basically every time, instead of the sometimes you'd get today depending on the original capacity.
So maybe this should be very conservative in the space it leaves for the allocator overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again, though, is leaving any space at all pessimizing good allocators in favour of mediocre ones?
One big advantage of rust's allocator APIs is that deallocation passes the size. So, for example, if an allocator allocates dedicated (maybe-large) pages for 32-byte allocations, they don't necessarily need any per-allocation overhead to do that, as they can tell by the deallocation size that it's a page of 32-byte allocations, and could have per-page overhead instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think that any allocator that uses most than just a single pointer is meh, and is not worth pessimizing.
But yeah... unfortunately this optimization is assuming implementation details that are invisible. Hopefully benchmarks will tell if any platform degraded, and we could always use conditional compilation or something to tweak it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another interesting case to consider: the windows System
allocator has to implement alignment on top of allocation which does not support custom alignment. This means it allocates align
extra bytes.
Yes, Rust's allocation API allows the use of allocators which don't have to store metadata in allocations, but in practice, most allocators will want to support the C allocation API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment on top of allocation which does not support custom alignment. This means it allocates align extra bytes.
@CAD97 I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust/library/std/src/sys/windows/alloc.rs
Lines 129 to 178 in dec6894
// Allocate a block of optionally zeroed memory for a given `layout`. | |
// SAFETY: Returns a pointer satisfying the guarantees of `System` about allocated pointers, | |
// or null if the operation fails. If this returns non-null `HEAP` will have been successfully | |
// initialized. | |
#[inline] | |
unsafe fn allocate(layout: Layout, zeroed: bool) -> *mut u8 { | |
let heap = init_or_get_process_heap(); | |
if heap.is_null() { | |
// Allocation has failed, could not get the current process heap. | |
return ptr::null_mut(); | |
} | |
// Allocated memory will be either zeroed or uninitialized. | |
let flags = if zeroed { HEAP_ZERO_MEMORY } else { 0 }; | |
if layout.align() <= MIN_ALIGN { | |
// SAFETY: `heap` is a non-null handle returned by `GetProcessHeap`. | |
// The returned pointer points to the start of an allocated block. | |
unsafe { HeapAlloc(heap, flags, layout.size()) as *mut u8 } | |
} else { | |
// Allocate extra padding in order to be able to satisfy the alignment. | |
let total = layout.align() + layout.size(); | |
// SAFETY: `heap` is a non-null handle returned by `GetProcessHeap`. | |
let ptr = unsafe { HeapAlloc(heap, flags, total) as *mut u8 }; | |
if ptr.is_null() { | |
// Allocation has failed. | |
return ptr::null_mut(); | |
} | |
// Create a correctly aligned pointer offset from the start of the allocated block, | |
// and write a header before it. | |
let offset = layout.align() - (ptr.addr() & (layout.align() - 1)); | |
// SAFETY: `MIN_ALIGN` <= `offset` <= `layout.align()` and the size of the allocated | |
// block is `layout.align() + layout.size()`. `aligned` will thus be a correctly aligned | |
// pointer inside the allocated block with at least `layout.size()` bytes after it and at | |
// least `MIN_ALIGN` bytes of padding before it. | |
let aligned = unsafe { ptr.add(offset) }; | |
// SAFETY: Because the size and alignment of a header is <= `MIN_ALIGN` and `aligned` | |
// is aligned to at least `MIN_ALIGN` and has at least `MIN_ALIGN` bytes of padding before | |
// it, it is safe to write a header directly before it. | |
unsafe { ptr::write((aligned as *mut Header).sub(1), Header(ptr)) }; | |
// SAFETY: The returned pointer does not point to the to the start of an allocated block, | |
// but there is a header readable directly before it containing the location of the start | |
// of the block. | |
aligned | |
} | |
} |
Roughly:
fn win32::HeapAlloc(hHeap: win32::HANDLE, dwFlags: win32::DWORD, dwBytes: win32::SIZE_T) -> win32::LPVOID;
fn win32::HeapFree (hHeam: win32::HANDLE, dwFlags: win32::DWORD, lpMem: win32::LPVOID);
fn HeapAlignedAlloc(heap: win32::HANDLE, flags: win32::DWORD, size: usize, align: usize) -> *mut u8 {
if align <= win32::MEMORY_ALLOCATION_ALIGNMENT {
return win32::HeapAlloc(heap, flags, size) as *mut u8;
}
let ptr = win32::HeapAlloc(heap, flags, size + align);
let aligned = (ptr as *mut u8).add(1).align_to(align);
*(aligned as *mut win32::LPVOID).sub(1) = ptr;
aligned
}
fn HeapAlignedFree(heap: win32::HANDLE, flags: win32::DWORD, ptr: *mut u8, size: usize, align: usize) {
if align <= win32::MEMORY_ALLOCATION_ALIGNMENT {
win32::HeapFree(heap, flags, ptr as _)
} else {
win32::HeapFree(heap, flags, *(ptr as *mut win32::LPVOID).sub(1))
}
}
Windows doesn't provide a way to HeapAlloc
at anything other than MEMORY_ALLOCATION_ALIGNMENT
. (Windows does offer _alligned_malloc
, _aligned_realloc
, _aligned_recalloc
, and _aligned_free
, but strangely not _aligned_calloc
.) std::alloc::System
is documented to use HeapAlloc
on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another interesting case to consider: the windows
System
allocator has to implement alignment on top of allocation which does not support custom alignment. This means it allocatesalign
extra bytes.
Wouldn't memory allocator request much, much larger chunks of memory from the system at once, and then use them for fine-grained user-allocations individually? So any system level alignment issues would generally have much smaller overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An actual new allocator implementation, yes, typically. I'm specifically referring to System
or other Rust wrappers of allocators which don't provide an aligned allocation interface, which need to do this technique on a per-allocation basis.
While this may be only the Windows system allocator in practice, this is still the default allocator that most Rust programs compiled for Windows will be using, so it's an important case to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of understand, but not enough to come up with a solution. Worst case we could just have a different implementation for Windows maybe?
I fixed the I got excited about this change, but I'm running out of time today, and will be busy over the weekend. I'm also not very familiar with working on changes here and I'm hitting some blockers for testing locally (other than compiling). If anyone feels like working on it in the meantime (or taking over altogether), please feel free. I'm very curious if this theory leads to any noticeable improvements and just wanted to show something for it. @scottmcm you seem to have all the good ideas already. :) |
This comment has been minimized.
This comment has been minimized.
All types are aligned to a power of two. This is a hard requirement of Rust types. It's worth noting that allocations through However, the current common allocators (system and jemalloc) don't actually return this information, because it's typically extra overhead to get that dynamic information from the allocator, if it's available at all. It's also a bit more complicated than just requesting a well-fit size, because the allocator has to accept any layout for freeing between the requested and reported provided layouts. If the underlying allocator implementation doesn't support this, even if it would benefit from better bin fitting, the Rust allocator wrapper can't do such adjustments. |
f07a5cc
to
420b7fd
Compare
This comment has been minimized.
This comment has been minimized.
RawVec
to fill the allocator bins tighter.RawVec
to fill the allocator bins tighter
This comment has been minimized.
This comment has been minimized.
8692d34
to
0969b20
Compare
How will I bench it once I finally get it to even work? OK to point me to docs, I just failed to find it. |
This comment has been minimized.
This comment has been minimized.
A usual rust-timer run is probably sufficient to see the performance impact. Whoever reviews will kick that off. |
0969b20
to
f7349ac
Compare
This comment has been minimized.
This comment has been minimized.
f7349ac
to
edea79e
Compare
OK. Basic version passed, now I did some tweaks on top of just adding up sizing (see commit message). |
This comment has been minimized.
This comment has been minimized.
472fa44
to
0e47358
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
96da587
to
4202060
Compare
I spoke too soon and even the basic tests didn't pass yet. Took me another hour to fix some corner cases. We'll see if everything is green now. The pain is real. Also the code accumulated a lot checks if conditions, so I'm growing pessimistic. |
@Mark-Simulacrum Done, CI green, sanity test is there. Not sure how it works, an if I ruined this one run. Please let's bench it. In the meantime I'm trying to get the setup you linked to working, but I have some issues as I'm on NixOS. |
@bors try |
⌛ Trying commit 4202060 with merge c7479451ba3d0891c63ba17d918fad2fc55da9ba... |
☀️ Try build successful - checks-actions |
Queued c7479451ba3d0891c63ba17d918fad2fc55da9ba with parent e7cdd4c, future comparison URL. |
Finished benchmarking commit (c7479451ba3d0891c63ba17d918fad2fc55da9ba): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Footnotes |
Hey, the RSS metrics look OK, if I'm reading it right. If only I could optimize that code, maybe it would be a win... One thing I wonder now is if it's not too "tight". It will sometimes make very small allocations now, just enough to bump to the bin-size, which can be really close. Some minimum increase might help. |
OK, I pushed a simple change that should make it allocate more at once (on average). I'm still setting up things to be able to bench locally. @Mark-Simulacrum Please try again. Thank you. |
@bors try @rust-timer queue |
⌛ Trying commit d5690e4 with merge 2c45840544b665d8c8b835334cabfd27566ea36e... |
☀️ Try build successful - checks-actions |
Is it still in a queue, or something bad happened to it? |
Still in queue, it was stuck for some time today. You can see the status at perf.rust-lang.org/status.html. |
One important observation: this optimization doesn't make much sense for bigger However I figured out how to bench it locally, an the results are not great. Generally small slowdown across the board, for no trend of even any memory savings. Handful of binaries seem to benefit (-8%), but some other handful suffer (+10%). I suspect that a lot of binaries actually have a need for exactly "2^n" sized buffers, in which case my change just gets in a way by over-sizing the capacity for them to to the fit with overhead to next power of two sized bin. |
A warning: I have spent quite some time working on the |
It was a fun time, but I think it's time to give up. I pushed the best I was able to get: slight memory decrease in some cases (majority), slight decrease in some others (minority, but seems like some of the most "significant" ones). On top of it the instruction/cycles were slightly negative across the board. Seems rounding up capacity to On top of it calculating all the sizes etc. is additional overhead, and in has to deal with overflow-checking math. |
AFAIK, most if not not all memory allocators use some combination
of evenly sized bins. Typically these bins are power of two sized,
sometimes suplemented with some 0b1100xxx... bins as well.
Most of the time every allocation is also prefixed with a pointer to some
per-allocation metadata, adding a fixed overhead to every requested allocation.
This can observed with:
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=d091d5b6fc0ea89e5e07438bdb740ac3
For
let s = 24
the pointers are 32 bytes apart, but increasings
to 25 makethem 64 bytes apart, as the allocation falls into the one up sized bin.
This made me think that the default way of growing collections in Rust
(doubling their capacity) is degenerate in most common cases.
Most typesis aligned to the power of 2 size, and then doubling their
size make almost every allocation waste almost 50% of actually allocated
space for it.
Edit After a bit more experimenting: For larger sizes (> 64),
the allocator returns pointers with 16 of extra space between.
Maybe these are some skip lists, maybe something else. I crossed
out the naive conclusion, that works only for small allocations.
However the point stands. Whatever the allocator is using is most probably
not the optimal case. By trying to allocate capacity to the optimal value,
we can opportunistically save memory and make allocator's work easier.
By growing the capacity by trying to fill the bin well, we can possibly
avoid some needless allocations and lower the memory consumption.
TODOs:
2*cap
anymore