Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of Vec #18

Open
overlookmotel opened this issue May 18, 2024 · 8 comments
Open

Reduce size of Vec #18

overlookmotel opened this issue May 18, 2024 · 8 comments
Assignees

Comments

@overlookmotel
Copy link

overlookmotel commented May 18, 2024

Vec is currently 32 bytes. It consists of 4 x 8-bytes fields:

  • Pointer to Vec's contents (NonNull<u8>)
  • Length (usize)
  • Capacity (usize)
  • Pointer to allocator (*const Allocator)

Reduce to 24 bytes

Make length and capacity fields u32.

This is pretty easy.

Reduce to 16 bytes

If arena allocator always created chunks aligned on 4 GiB boundaries, and stored chunk metadata in a header at start of chunks, the pointer field could do double-duty.

Pointer field would be a pointer to the Vec's contents. But also if you zero out the bottom 32 bits it would give you a pointer to the chunk metadata. The chunk metadata would contain a pointer back to the Allocator, with which you can grow/reallocate the Vec etc.

This would require replacing bumpalo with a custom allocator.

@overlookmotel
Copy link
Author

overlookmotel commented Aug 5, 2024

It's also possible to get Vec down to 16 bytes without aligning all allocator chunks on 4 GiB, if can accept:

  • A lower limit on len/capacity of Vec.
  • Extra overhead of 1 bitwise operation when getting capacity.
  • Extra overhead of several operations when resizing Vec.

As follows:

  • Allocator allocates all chunks with same alignment as size of the chunk.
  • Reserve top 5 bits of capacity field (5 bits can store a number 0 - 31).
  • Store the power of the chunk's alignment in those top 5 bits i.e. if chunk is aligned on 65536 (1 << 16), top 5 bits contain 16.
  • Max capacity of Vec is limited to 134 million (1 << 27).
  • Do not store pointer to Allocator as separate field, but deduce it from data pointer + the top 5 bits of capacity.
  • Each allocator chunk contains a pointer to Allocator in its footer metadata block.
struct AllocatorChunkFooter {
    allocator: *const Allocator,
    /* other fields */
}

struct Vec<T> {
    ptr: NonNull<T>,
    len: u32,
    capacity: u32,
}

const CHUNK_ALIGNMENT_BITS: u32 = 5;
const CAPACITY_MASK: u32 = u32::MAX >> CHUNK_ALIGNMENT_BITS;
const CHUNK_ALIGNMENT_SHIFT: u32 = 32 - CHUNK_ALIGNMENT_BITS;
const ALLOCATOR_FIELD_MASK: usize = usize::MAX - std::mem::size_of::<AllocatorChunkFooter>() + 1
    + std::mem::offset_of!(AllocatorChunkFooter, allocator);

impl<T> Vec<T> {
    pub fn len(&self) -> u32 { self.len }

    pub fn capacity(&self) -> u32 { self.capacity & CAPACITY_MASK }

    pub fn push(&mut self, value: T) {
        if self.len() == self.capacity() { self.grow_for_push(); }
        self.ptr.as_ptr().add(len).write(value);
        self.len += 1;
    }

    #[cold]
    fn grow_for_push(&mut self) {
        let ptr = self.ptr.as_ptr() as *const u8;
        let power = (self.capacity >> CHUNK_ALIGNMENT_SHIFT) as usize;
        let chunk_end_minus_one_addr = ptr as usize | ((1 << power) - 1);
        let allocator_ptr_addr = chunk_end_minus_one_addr & ALLOCATOR_FIELD_MASK;
        let allocator_ptr = ptr.add(allocator_ptr_addr - ptr as usize) as *const Allocator;
        let allocator: &Allocator = &*allocator_ptr;
        // Reallocate `Vec` using `allocator`
     }
}

Notes:

  • Only overhead for reading capacity is 1 x bitwise AND operation.
  • This does unfortunately affect Vec::push and other potentially-resizing operations.
  • The logic for locating pointer to Allocator is a bit complex, but it's in a cold path.
  • Logic for getting Allocator ptr is 100% cheap bitwise ops.
  • Individual arena chunks are limited to 2 GiB max (this corresponds to max alignment on Mac OS).
  • There is no limit on size of arena (arena can consist of any number of chunks).
  • If minimum arena chunk size is 64 KiB (1 << 16 bytes), could use top 4 bits of capacity instead of 5 bits.
  • It'd be even better if stored 32 - power in top 5 bits of capacity. Then chunk_end_minus_one_addr = ptr as usize | (usize::MAX >> inverted_power). This removes 1 operation: https://godbolt.org/z/Mx5WYv86r
  • If push() is more common than len(), could store power in top 5 bits of len field too. Then "is capacity full?" check in push() becomes just self.len == self.capacity (remove an AND operation), at the cost of moving that AND operation into len() instead.

@overlookmotel
Copy link
Author

Alternatively, could record power in bottom 3 bits of ptr (if all allocations are aligned at least on 8).

This would allow 8 sizes of allocator chunks: 16 MiB, 32 MiB, 64 MiB, 128 MiB, 256 MiB, 512 MiB, 1 GiB, 2 GiB.

That's probably sufficient. However, I think using bits in capacity is preferable, as capacity is less frequently accessed than ptr, which is read on every read/write of the Vec's elements. Clearing bottom 3 bits is only a single bitwise AND operation, but it's an extremely hot path.

@overlookmotel
Copy link
Author

overlookmotel commented Aug 19, 2024

If using bottom 3 bits of ptr to store power, could also multiply the stored power by 2, which would give a wider range of chunk sizes:

128KiB, 512 KiB, 2 MiB, 8 MiB, 32 MiB, 128 MiB, 512 MiB, 2 GiB

I still think it's better to use spare bits in capacity, but advantage of packing it into ptr is that then Box could also contain an implicit reference to its Allocator while remaining 8 bytes. This would allow us to make Box Clone, which in turn means all AST types could also be Clone.

There is also the option to use high bits of ptr. But these bits can possibly be put to better use: #91

@Boshen
Copy link
Member

Boshen commented Sep 10, 2024

Reduce to 24 bytes
Make length and capacity fields u32.

I forked https://github.com/oxc-project/oxc-bumpalo, changing usize to u32 is hard than it seems :-/

@overlookmotel
Copy link
Author

overlookmotel commented Sep 12, 2024

Yes, I realised we'd need to combine all the fields into a single struct:

// 24 bytes
struct ArenaVec<'a, T> {
    ptr: NonNull<T>,
    len: u32,
    capacity: u32,
    allocator: &'a Allocator,
}

Splitting it into Vec and RawVec (the way std, bumpalo and allocator-api2 do) won't work, because RawVec {ptr: NonNull<T>, cap: u32, allocator: &'a Allocator} is 24 bytes (inc 4 bytes padding). We need to fit len into those spare 4 bytes, but you can't do that with nested structs.

I don't think combining all the methods into 1 struct is so terrible - it just means moving a lot of the code about, rather than a fundamental change. And we can still break up the implementation into multiple files to make it manageable.

But yeah, it's not trivial.

@rzvxa
Copy link
Collaborator

rzvxa commented Sep 12, 2024

Related to this, Bumpalo is planning to remove the Vec types in favor of std types as soon as the API is stable. So it would be wise to start using our custom vectors if we wish to keep control over our types.

Here I tried to offer to implement a feature-gated Vec32 type fitzgen/bumpalo#256 and found out that they are not too keen on keeping their vector types(Which makes sense, In general, you want to use std::vec if it supports custom allocators)

@rzvxa

This comment was marked as resolved.

@overlookmotel
Copy link
Author

I doubt Allocator trait will be stabilized any time soon. It's such a large feature that it has it's own working group. With 72 open issues! https://github.com/rust-lang/wg-allocators/issues

I don't actually think it'll be too hard to copy std/bumpalo/allocator_api2's implementation and chop it about. We can probably skip implementing some of the more obscure methods and iterators to reduce the size of the task.

Also relevant:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants