Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grow RawVec to fill the allocator bins tighter [please bench it] #101341

Closed
wants to merge 10 commits into from

Conversation

dpc
Copy link
Contributor

@dpc dpc commented Sep 2, 2022

AFAIK, most if not not all memory allocators use some combination
of evenly sized bins. Typically these bins are power of two sized,
sometimes suplemented with some 0b1100xxx... bins as well.

Most of the time every allocation is also prefixed with a pointer to some
per-allocation metadata, adding a fixed overhead to every requested allocation.

This can observed with:

fn main() {
    let s = 24;
    let v1: Vec<u8> = Vec::with_capacity(s);
    let v2: Vec<u8> = Vec::with_capacity(s);
    let v3: Vec<u8> = Vec::with_capacity(s);

    println!("{:?}", v1.as_ptr());
    println!("{:?}", v2.as_ptr());
    println!("{:?}", v3.as_ptr());
}

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=d091d5b6fc0ea89e5e07438bdb740ac3

For let s = 24 the pointers are 32 bytes apart, but increasing s to 25 make
them 64 bytes apart, as the allocation falls into the one up sized bin.

This made me think that the default way of growing collections in Rust
(doubling their capacity) is degenerate in most common cases. Most types
is aligned to the power of 2 size, and then doubling their
size make almost every allocation waste almost 50% of actually allocated
space for it.

Edit After a bit more experimenting: For larger sizes (> 64),
the allocator returns pointers with 16 of extra space between.
Maybe these are some skip lists, maybe something else. I crossed
out the naive conclusion, that works only for small allocations.
However the point stands. Whatever the allocator is using is most probably
not the optimal case. By trying to allocate capacity to the optimal value,
we can opportunistically save memory and make allocator's work easier.

By growing the capacity by trying to fill the bin well, we can possibly
avoid some needless allocations and lower the memory consumption.

TODOs:

  • get to compile & pass tests
  • do initial bench
  • don't do 2*cap anymore
  • put a cap on how big of an allocation this is(?); for bigger sizes allocator might be doing something else than bins; what cutoff to use thought?

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Sep 2, 2022
@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 2, 2022
@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from b114349 to ee15ce6 Compare September 2, 2022 20:32
@leonardo-m
Copy link

I'm curious to see benchmarks of this :)

@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from ee15ce6 to f07a5cc Compare September 2, 2022 21:25
Copy link
Member

@scottmcm scottmcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-comment: Maybe it'd be worth adjusting this by the size of the allocation too? For an allocation of a few dozen megabytes, it's not so obvious the rounding up to a power of two is necessary, as they might just get pages from the OS for the actual size, rather than bucketing.

library/alloc/src/raw_vec.rs Outdated Show resolved Hide resolved
// FIXME: maybe two pointers to be on the safe side? It could potentially be platform-dependent.
let aligned_alloc_size = bin_size.saturating_sub(mem::align_of::<usize>());

// Ignore cases where the value is actually smaller, due to substractiing fixed overhead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me notice that the current approach is basically

let new_cap = max![ len + additional, cap * 2, MIN_CAP ];

so I wonder if it might make sense to just make this another thing in that list, or even replacing the cap * 2.

sizeof(usize).div_ceil(sizeof(T)) could be a constant like MIN_NON_ZERO_CAP is, so it's easy to subtract off even in length units instead of needing to switch to byte units here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of needing to switch to byte units here

The byte units are used because what we're targeting is a byte size of (just under) a power of two. We can const provide "usize per T" and "T per usize", but I don't see how to calculate (size_of::<T>() * cap).next_power_of_two() / size_of::<T>() without scaling cap into byte units instead of T units.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right! Good point.

let bin_size = usize::MAX >> (alloc_size.leading_zeros() - 1);

// Leave some room for allocators that add fixed overhead (usually one pointer-size)
// FIXME: maybe two pointers to be on the safe side? It could potentially be platform-dependent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a good point. This change would massively backfire if it underestimates the overhead, as the underlying allocator would then need to massively overallocate basically every time, instead of the sometimes you'd get today depending on the original capacity.

So maybe this should be very conservative in the space it leaves for the allocator overhead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, though, is leaving any space at all pessimizing good allocators in favour of mediocre ones?

One big advantage of rust's allocator APIs is that deallocation passes the size. So, for example, if an allocator allocates dedicated (maybe-large) pages for 32-byte allocations, they don't necessarily need any per-allocation overhead to do that, as they can tell by the deallocation size that it's a page of 32-byte allocations, and could have per-page overhead instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think that any allocator that uses most than just a single pointer is meh, and is not worth pessimizing.

But yeah... unfortunately this optimization is assuming implementation details that are invisible. Hopefully benchmarks will tell if any platform degraded, and we could always use conditional compilation or something to tweak it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another interesting case to consider: the windows System allocator has to implement alignment on top of allocation which does not support custom alignment. This means it allocates align extra bytes.

Yes, Rust's allocation API allows the use of allocators which don't have to store metadata in allocations, but in practice, most allocators will want to support the C allocation API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment on top of allocation which does not support custom alignment. This means it allocates align extra bytes.

@CAD97 I don't understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Allocate a block of optionally zeroed memory for a given `layout`.
// SAFETY: Returns a pointer satisfying the guarantees of `System` about allocated pointers,
// or null if the operation fails. If this returns non-null `HEAP` will have been successfully
// initialized.
#[inline]
unsafe fn allocate(layout: Layout, zeroed: bool) -> *mut u8 {
let heap = init_or_get_process_heap();
if heap.is_null() {
// Allocation has failed, could not get the current process heap.
return ptr::null_mut();
}
// Allocated memory will be either zeroed or uninitialized.
let flags = if zeroed { HEAP_ZERO_MEMORY } else { 0 };
if layout.align() <= MIN_ALIGN {
// SAFETY: `heap` is a non-null handle returned by `GetProcessHeap`.
// The returned pointer points to the start of an allocated block.
unsafe { HeapAlloc(heap, flags, layout.size()) as *mut u8 }
} else {
// Allocate extra padding in order to be able to satisfy the alignment.
let total = layout.align() + layout.size();
// SAFETY: `heap` is a non-null handle returned by `GetProcessHeap`.
let ptr = unsafe { HeapAlloc(heap, flags, total) as *mut u8 };
if ptr.is_null() {
// Allocation has failed.
return ptr::null_mut();
}
// Create a correctly aligned pointer offset from the start of the allocated block,
// and write a header before it.
let offset = layout.align() - (ptr.addr() & (layout.align() - 1));
// SAFETY: `MIN_ALIGN` <= `offset` <= `layout.align()` and the size of the allocated
// block is `layout.align() + layout.size()`. `aligned` will thus be a correctly aligned
// pointer inside the allocated block with at least `layout.size()` bytes after it and at
// least `MIN_ALIGN` bytes of padding before it.
let aligned = unsafe { ptr.add(offset) };
// SAFETY: Because the size and alignment of a header is <= `MIN_ALIGN` and `aligned`
// is aligned to at least `MIN_ALIGN` and has at least `MIN_ALIGN` bytes of padding before
// it, it is safe to write a header directly before it.
unsafe { ptr::write((aligned as *mut Header).sub(1), Header(ptr)) };
// SAFETY: The returned pointer does not point to the to the start of an allocated block,
// but there is a header readable directly before it containing the location of the start
// of the block.
aligned
}
}

Roughly:

fn win32::HeapAlloc(hHeap: win32::HANDLE, dwFlags: win32::DWORD, dwBytes: win32::SIZE_T) -> win32::LPVOID;
fn win32::HeapFree (hHeam: win32::HANDLE, dwFlags: win32::DWORD, lpMem: win32::LPVOID);

fn HeapAlignedAlloc(heap: win32::HANDLE, flags: win32::DWORD, size: usize, align: usize) -> *mut u8 {
    if align <= win32::MEMORY_ALLOCATION_ALIGNMENT {
        return win32::HeapAlloc(heap, flags, size) as *mut u8;
    }
    let ptr = win32::HeapAlloc(heap, flags, size + align);
    let aligned = (ptr as *mut u8).add(1).align_to(align);
    *(aligned as *mut win32::LPVOID).sub(1) = ptr;
    aligned
}

fn HeapAlignedFree(heap: win32::HANDLE, flags: win32::DWORD, ptr: *mut u8, size: usize, align: usize) {
    if align <= win32::MEMORY_ALLOCATION_ALIGNMENT {
        win32::HeapFree(heap, flags, ptr as _)
    } else {
        win32::HeapFree(heap, flags, *(ptr as *mut win32::LPVOID).sub(1))
    }
}

Windows doesn't provide a way to HeapAlloc at anything other than MEMORY_ALLOCATION_ALIGNMENT. (Windows does offer _alligned_malloc, _aligned_realloc, _aligned_recalloc, and _aligned_free, but strangely not _aligned_calloc.) std::alloc::System is documented to use HeapAlloc on Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another interesting case to consider: the windows System allocator has to implement alignment on top of allocation which does not support custom alignment. This means it allocates align extra bytes.

Wouldn't memory allocator request much, much larger chunks of memory from the system at once, and then use them for fine-grained user-allocations individually? So any system level alignment issues would generally have much smaller overhead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An actual new allocator implementation, yes, typically. I'm specifically referring to System or other Rust wrappers of allocators which don't provide an aligned allocation interface, which need to do this technique on a per-allocation basis.

While this may be only the Windows system allocator in practice, this is still the default allocator that most Rust programs compiled for Windows will be using, so it's an important case to consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of understand, but not enough to come up with a solution. Worst case we could just have a different implementation for Windows maybe?

@dpc
Copy link
Contributor Author

dpc commented Sep 2, 2022

I fixed the clz usage (I think) and pushed that.

I got excited about this change, but I'm running out of time today, and will be busy over the weekend. I'm also not very familiar with working on changes here and I'm hitting some blockers for testing locally (other than compiling). If anyone feels like working on it in the meantime (or taking over altogether), please feel free. I'm very curious if this theory leads to any noticeable improvements and just wanted to show something for it.

@scottmcm you seem to have all the good ideas already. :)

@rust-log-analyzer

This comment has been minimized.

@CAD97
Copy link
Contributor

CAD97 commented Sep 3, 2022

Most types is aligned to the power of 2 size

All types are aligned to a power of two. This is a hard requirement of Rust types.


It's worth noting that allocations through Allocator return a layout for exactly this purpose. The "proper" way to do this imho isn't to do this logic in RawVec, but to instead use the extra space which the allocator reports. That way, you match the specifics of the allocator in use directly.

However, the current common allocators (system and jemalloc) don't actually return this information, because it's typically extra overhead to get that dynamic information from the allocator, if it's available at all. It's also a bit more complicated than just requesting a well-fit size, because the allocator has to accept any layout for freeing between the requested and reported provided layouts. If the underlying allocator implementation doesn't support this, even if it would benefit from better bin fitting, the Rust allocator wrapper can't do such adjustments.

@dpc dpc force-pushed the grow-to-bin-size branch from f07a5cc to 420b7fd Compare September 3, 2022 20:55
@rust-log-analyzer

This comment has been minimized.

@dpc dpc changed the title Grow RawVec to fill the allocator bins tighter. Grow RawVec to fill the allocator bins tighter Sep 3, 2022
@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from 8692d34 to 0969b20 Compare September 3, 2022 21:56
@dpc
Copy link
Contributor Author

dpc commented Sep 3, 2022

How will I bench it once I finally get it to even work? OK to point me to docs, I just failed to find it.

@rust-log-analyzer

This comment has been minimized.

@CAD97
Copy link
Contributor

CAD97 commented Sep 3, 2022

A usual rust-timer run is probably sufficient to see the performance impact. Whoever reviews will kick that off.

library/alloc/src/raw_vec.rs Outdated Show resolved Hide resolved
@dpc dpc force-pushed the grow-to-bin-size branch from 0969b20 to f7349ac Compare September 3, 2022 22:58
@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from f7349ac to edea79e Compare September 3, 2022 23:18
@dpc
Copy link
Contributor Author

dpc commented Sep 3, 2022

OK. Basic version passed, now I did some tweaks on top of just adding up sizing (see commit message).

@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from 472fa44 to 0e47358 Compare September 4, 2022 00:05
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@dpc dpc force-pushed the grow-to-bin-size branch from 96da587 to 4202060 Compare September 5, 2022 00:19
@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

I spoke too soon and even the basic tests didn't pass yet. Took me another hour to fix some corner cases. We'll see if everything is green now. The pain is real.

Also the code accumulated a lot checks if conditions, so I'm growing pessimistic.

@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

@Mark-Simulacrum Done, CI green, sanity test is there. Not sure how it works, an if I ruined this one run. Please let's bench it.

In the meantime I'm trying to get the setup you linked to working, but I have some issues as I'm on NixOS.

@Mark-Simulacrum
Copy link
Member

@bors try

@bors
Copy link
Contributor

bors commented Sep 5, 2022

⌛ Trying commit 4202060 with merge c7479451ba3d0891c63ba17d918fad2fc55da9ba...

@bors
Copy link
Contributor

bors commented Sep 5, 2022

☀️ Try build successful - checks-actions
Build commit: c7479451ba3d0891c63ba17d918fad2fc55da9ba (c7479451ba3d0891c63ba17d918fad2fc55da9ba)

@rust-timer
Copy link
Collaborator

Queued c7479451ba3d0891c63ba17d918fad2fc55da9ba with parent e7cdd4c, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c7479451ba3d0891c63ba17d918fad2fc55da9ba): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
2.0% [0.4%, 13.9%] 209
Regressions ❌
(secondary)
1.7% [0.4%, 4.6%] 138
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.1% [-1.1%, -1.1%] 1
All ❌✅ (primary) 2.0% [0.4%, 13.9%] 209

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
5.0% [2.4%, 8.1%] 6
Regressions ❌
(secondary)
3.6% [0.9%, 5.7%] 14
Improvements ✅
(primary)
-2.3% [-5.1%, -1.1%] 74
Improvements ✅
(secondary)
-2.6% [-5.4%, -1.3%] 16
All ❌✅ (primary) -1.8% [-5.1%, 8.1%] 80

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
2.8% [1.4%, 10.5%] 15
Regressions ❌
(secondary)
3.0% [2.1%, 5.0%] 7
Improvements ✅
(primary)
-2.1% [-2.2%, -2.0%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.2% [-2.2%, 10.5%] 17

Footnotes

  1. the arithmetic mean of the percent change 2 3

  2. number of relevant changes 2 3

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 5, 2022
@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

Hey, the RSS metrics look OK, if I'm reading it right. If only I could optimize that code, maybe it would be a win...

One thing I wonder now is if it's not too "tight". It will sometimes make very small allocations now, just enough to bump to the bin-size, which can be really close. Some minimum increase might help.

@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

OK, I pushed a simple change that should make it allocate more at once (on average). I'm still setting up things to be able to bench locally.

@Mark-Simulacrum Please try again. Thank you.

@Kobzol
Copy link
Contributor

Kobzol commented Sep 5, 2022

@bors try @rust-timer queue

@bors
Copy link
Contributor

bors commented Sep 5, 2022

⌛ Trying commit d5690e4 with merge 2c45840544b665d8c8b835334cabfd27566ea36e...

@bors
Copy link
Contributor

bors commented Sep 5, 2022

☀️ Try build successful - checks-actions
Build commit: 2c45840544b665d8c8b835334cabfd27566ea36e (2c45840544b665d8c8b835334cabfd27566ea36e)

@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

Is it still in a queue, or something bad happened to it?

@Kobzol
Copy link
Contributor

Kobzol commented Sep 5, 2022

Still in queue, it was stuck for some time today. You can see the status at perf.rust-lang.org/status.html.

@dpc
Copy link
Contributor Author

dpc commented Sep 5, 2022

One important observation: this optimization doesn't make much sense for bigger Ts as it's unlikely that they can be aligned enough for this to make a difference. I've added a change to address it.

However I figured out how to bench it locally, an the results are not great. Generally small slowdown across the board, for no trend of even any memory savings. Handful of binaries seem to benefit (-8%), but some other handful suffer (+10%).

I suspect that a lot of binaries actually have a need for exactly "2^n" sized buffers, in which case my change just gets in a way by over-sizing the capacity for them to to the fit with overhead to next power of two sized bin.

@nnethercote
Copy link
Contributor

A warning: I have spent quite some time working on the RawVec growth code in the past. It is very hot and even tiny changes can have surprisingly large performance effects. The existing code has been tuned extensively, and numerous attempts at improvements have failed. That's not to say there is no room for improvements, but I think the low-hanging fruit has already been picked here.

@dpc
Copy link
Contributor Author

dpc commented Sep 6, 2022

It was a fun time, but I think it's time to give up. I pushed the best I was able to get: slight memory decrease in some cases (majority), slight decrease in some others (minority, but seems like some of the most "significant" ones). On top of it the instruction/cycles were slightly negative across the board.

Seems rounding up capacity to n^2 - sizeof::<usize>() does kind of save memory when you end up with something under that capacity. My intuition tells me that what ruins it is that a lot of software does end up with a lot of 2^n sized buffers, and in this case rounding up leads to almost 50% over-allocated buffer, which is the worst case scenario. That would explain why some benches got -8%, and some other +8% RSS.

On top of it calculating all the sizes etc. is additional overhead, and in has to deal with overflow-checking math.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.