Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some form of escape analysis to allocate values on the stack #776

Open
yorickpeterse opened this issue Oct 31, 2024 · 14 comments
Open
Assignees
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Milestone

Comments

@yorickpeterse
Copy link
Collaborator

yorickpeterse commented Oct 31, 2024

Description

Types defined using the class keyword are heap allocated. While #750 talks about giving users the option to explicitly opt-in to stack allocating data, I feel we also need a mechanism to determine if we can do this automatically. That is, we need some form of escape analysis and stack allocate values if we determine this is safe to do so.

An area of the standard library that can really benefit from this is the std.iter module. Iterators are used heavily and incur a heap allocation. When used in tight loops, it's not uncommon for most time to be dominated by the memory management necessary for these iterators.

The basic idea is simple: if an allocated value V doesn't escape the scope S it's allocated into (meaning lifetime(V) <= lifetime(S)), we can allocate it onto the stack instead.

The implementation however is a bit more tricky. For example, we can only stack allocate values if they are allocated directly in their scope and not through a method call. This means that good inlining is crucial for escape analysis to have a meaningful impact. In addition, the implementation of escape analysis itself may prove a bit of a challenge.

Note that it's perfectly fine for borrows of stack allocated data to outlive the stack value, as dropping the stack value would still incur a runtime borrow check and produce a panic in such cases. In other words, such cases don't inhibit the ability to stack allocate.

Related issues

Related work

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance compiler Changes related to the compiler labels Oct 31, 2024
@yorickpeterse
Copy link
Collaborator Author

A related and older idea is to bump allocate small objects using a custom allocator (#542). The idea here was that by using our own allocator we can avoid the cost of going through malloc & friends. While this may not perform better compared to using e.g. jemalloc, glibc's allocator in particular is really not great when handling large amounts of allocations rapidly. This means it might be worth using for small objects (e.g. up to 64 bytes) that we can't stack allocate.

@yorickpeterse
Copy link
Collaborator Author

As for what constitutes as an escape: a conservative approach would be to just consider any move (except for a drop) an escape. This does mean that moving stack value S₁ into stack value S₂ results in S₁ being heap allocated, but it might be good enough as a start.

@yorickpeterse
Copy link
Collaborator Author

The way we generate MIR poses a challenge, such that we can't just look at move instructions. That is, code like this:

let a = User(name: 'Alice')

Results in something like this:

R₁ = allocate User
R₁.name = 'Alice'
R₂ = move R₁

This means that if we look at just move instructions, it might not be accurate/sufficient enough to determine if we can stack allocate User.

@yorickpeterse
Copy link
Collaborator Author

Assuming the moves are all in the same scopes, it doesn't really matter how many moves there are, since we can copy the data if needed. Whether copying actually happens depends largely on how LLVM decides to compile/optimize the code.

Things do get tricky if we borrow in between such moves, because the move would invalidate the borrow.

@yorickpeterse
Copy link
Collaborator Author

Given the following instructions:

R₁ = allocate User(...)
R₂ = move R₁
R₃ = move R₂
...

R₁ should be stack allocated, but the decision to do that depends on R₂, R₃ and others not escaping the scope R₁ is defined in. So given a move chain of A -> B -> C -> ..., if all of those happen in the same scope or a nested scope, we can stack allocate A.

For this to work, for every register R that's assigned the result of an allocate, we need to track the registers it's moved to (recursively), and record the scopes those moves take place in.

@yorickpeterse
Copy link
Collaborator Author

For #677 I have a WIP parser that uses the commonly used strftime()/strptime() format. This parser allocates many Option[Int] instances, where the Int is a byte.

Stack allocating these values would likely greatly improve performance. However, I'm not sure it's enough to just rely on escape analysis. That is, for that to work the allocations need to be inlined, and given the complexity of the parser that's not a guarantee.

An alternative might be to not rely on escape analysis at all, but instead look at the type definition. That is, if the type size is e.g. less than or equal to 64 bytes, and all its sub values are trivial to copy, we just always stack allocate it and copy it upon borrowing. This does mean we need some sort of mut T where T is on the stack but the borrow is an actual pointer, such that mutable methods can mutate the data in-place.

@yorickpeterse
Copy link
Collaborator Author

I've spent some time building a bump allocator for small objects based on Immix, as found on the bump-allocator branch. While a naive bump allocator can improve performance in allocation heavy applications, the moment you throw in reuse/recycling of memory the improvements are reduced greatly. This is because reusing memory requires additional bookkeeping, which in turn imposes a cost on both the allocation and deallocation path.

For the work in progress DateTime.parse implementation the bump allocator reduces the time by about 30%, but for https://github.com/jinyus/related_post_gen there's little to no difference. A key issue there seems to be that it simply requires a lot of objects and thus blocks, and thus most time is spent allocating those blocks. Increasing the block size helps a little bit, but it's still outperformed by using jemalloc (310 ms vs 325-330 ms for my bump allocator).

I'm now wondering if we perhaps could take a simpler approach. That is, instead of relying on escape analysis we instead promote types to value types based on their definition rather than their usage. For example, types such as Option[Int] could easily be turned into value types and passed around using copies, removing the need for runtime memory allocation entirely.

The downside is that "value type promotion" only works for simple types such as Option[Int] but not for more complex types such as Array[Int], while bump allocation could help with more complex types. Stack allocation would also be helpful for this, but is likely to be incorrect (as in, not promoting values to the stack) more often than desired.

@yorickpeterse
Copy link
Collaborator Author

Another issue with value type promotion: if we just blindly promote a type to a value type (assuming it stores other value types), then we might break circular types (i.e. a linked list).

@yorickpeterse
Copy link
Collaborator Author

For the bump allocation scheme, I think we can make the following changes:

  1. When finding the next hole to allocate into, simply just find the next line and that's it, instead of trying to find a hole consisting of multiple lines. I think the total cost might be roughly the same, but at least the hole finding logic will be simpler.
  2. Finding reusable blocks whenever we exhaust the last block is too expensive, as we can end up with the pattern "find no reusable blocks -> allocate new block -> exhaust block -> find no reusable blocks -> repeat". A better approach may be to only do this every N block allocations
  3. The mechanism for finding reusable lines using atomics is way too expensive, with a worst-case of 254 atomic loads. This needs to go, but I'm not sure yet how.

@yorickpeterse
Copy link
Collaborator Author

One option is to flip things around: values go on the stack by default, and a hypothetical box expr expression puts it on the heap (e.g. box [10, 20, 30] or box SomeThing.new(...)). class instances still have headers and allow for dynamic dispatch, and are still subject to single ownership. Casting class values to a trait isn't possible and requires explicit boxing first.

To make that safe, we'd have to check the borrow count upon moving a stack value such that we don't invalidate any. In theory we can elide such checks if we can statically determine no borrows are present at the time of the move, but this won't be able to remove all such checks. Moving box T values around incurs no borrow check. Stack values are specialized over their size not the type, such that 15 different types of the same size being passed to a generic type results in only one copy of the type.

Implementation wise this is much simpler compared to escape analysis and value type promotion, and we drastically reduce the pressure on the system allocator. This in turn is likely to improve performance significantly.

The downsides are:

  • More runtime borrow checks when moving stack data around
  • I'm not sure we can strictly guarantee our borrow checks always occur before a move, because LLVM optimizations might reorder things. We might have to insert some sort of compiler fence to prevent that from happening
  • A more complicated type system, as users now have to deal with the differences between User, box User, etc
  • Recursive types now explicitly require boxing, e.g. a linked list Node must be defined as class Node[T] { let @value: T, let @next: Option[box Node[T]] }

@yorickpeterse
Copy link
Collaborator Author

The above idea won't work well for resizable types such as arrays: if you have an array of N stack values and you create a borrow to a value, then resizing the array will invalidate that borrow. To make that safe as well, we'd have to check the borrow count for every value, which will be terrible performance wise.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Nov 6, 2024

Enums could be promoted to stack based value types if the following conditions are met:

  • The values they wrap (if any) are also stack types
  • The type is not a recursive type

The reason this would be safe is that enums themselves can't be modified in-place, i.e. you can't change the constructor tag or the constructor's arguments. You can modify those in-place technically, but for value types that would result in a copy being modified.

Essentially the idea here is that in addition to introducing an inline keyword to flag a type as an immutable value type, we'd infer this for enum types, a bit like a hypothetical class enum inline? Foo { ... } definition where inline? roughly translates to "inline whenever possible, and fall back to a heap value otherwise".

For types such as Option[Int] and Option[Float] this would bring big improvements, as we can avoid the need for heap allocating the data.

A challenge here is specialization: if we specialize for such specific types we can remove the need for object headers, as generics can use static dispatch. This however may result in more copies of methods that operate on different types with the same memory layout. We'd also need a form of auto boxing when such values are cast to a trait, though we could also just straight up disallow casting enum values to traits (this isn't really useful anyway I think).

If we keep the headers this won't be necessary, but then e.g. Option[Int] is 32 bytes (16 bytes for the header, 8 for the tag, and 8 for the value) instead of 16 bytes (tag plus value). This might not be that big of a deal given enums are likely to live shortly.

For Array this requires a change to Array.get_unchecked and Array.get_unchecked_mut: these values read from a pointer and then use ref/mut to increment the borrow count. For stack values this would result in two copies: one as a result of the dereference, and one as a result of the ref/mut expression (which would copy the target). Perhaps we'd need to extend ref/mut to support operating on a pointer and in that case just not do anything if the pointer points to a stack value.

@yorickpeterse
Copy link
Collaborator Author

I started work on supporting stack/inline types in #778. For my WIP implementation of DateTime.parse (#677), the use of such inline types improves performance by 45-50%, as it no longer needs to allocate Option[Int] values all over the place.

@yorickpeterse
Copy link
Collaborator Author

#778 is merged, so the bulk of the work is done. I do think it's still worth implementing stack allocations on top of this. This way if a heap allocated type is used but we can trivially determine it should go on the stack, we can do just that automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Projects
None yet
Development

No branches or pull requests

1 participant