-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSA optimization is ineffective when an unconstrained function has loops #4535
Comments
mem2reg is one of the only passes that does work well between multiple blocks (note the constant The presence of the loop here is preventing the rest of the code from being removed. |
@sirasistant how common are small loops like this? I'm also not certain how much unrolling them will help since they will use non-constant values in the loop which means instructions from each iteration will still remain in the code. This example benefits greatly from loop unrolling mostly because there are no inputs to main so the loop can be removed entirely. If a loop is normally 5 iterations which contain 4 instructions each, unrolling even that small loop to get 20 instructions is more questionable. |
Oh I completely made up the size of the loop, it was just to show a functionally equal example so I used a loop that runs only one time. |
@sirasistant if you're just looking at why the loop isn't being unrolled that'd be the loop unrolling pass which is currently completely skipped for unconstrained code. |
I know, it's fine for it not to be unrolled (even in the case the bounds are known at compile time, unrolled loops can lead to bigger bytecode), I meant what optimization steps are not acting due to the presence of the loop itself |
I'm looking at the SSA and it seems that mem2reg is not identifying that the loads are known:
without loop
But the presence of the loop doesn't change that the loaded values are known in both cases |
I guess the ideal SSA mem2reg could output in this case given that It's not its job to remove loops with empty bodies, is something like this:
|
I've looked into this a bit further and it is indeed an issue with the mem2reg pass. The issue is with the core algorithm used though. Loops are just inherently tricky. The algorithm sees the blocks as follows: b0(): ; predecessors = []
b1(): ; predecessors = [b0, b2]
b2(): ; predecessors = [b1]
b3(): ; predecessors = [b1] The algorithm works as follows:
Take aways: fixing this requires changing the algorithm itself to handle loops better. One possible option is to have an entirely separate pass to go through each block and find where each I'll also note that identifying all pointer aliases correctly has been proven to be undecidable (Landi, 1992) so while we can improve this pass, we'll never be able to perfectly remove every reference. |
This might be related, poseidon is a bit unusable right now in brillig: This code: unconstrained fn main() -> pub Field {
dep::std::hash::poseidon2::Poseidon2::hash([0; 2], 2)
} Generates this huge SSA after optimization
Whereas on acir
|
@sirasistant I haven't looked into that SSA in-depth but what makes you think it is underoptimized? Comparing brillig to acir generation isn't the best comparison since Acir will always have all loops unrolled and if blocks flattened. With no blocks remaining many optimizations like mem2reg become trivial. So it's generally expected that Acir will be much simpler and smaller especially if loop bounds are small or if it uses all constant values. The only thing that jumps out to me from your example are some loops with small bounds (2 & 3). We could look into unrolling loops in brillig if they have small, known bounds. |
It's just that the SSA seemed quite large after optimization, I only saw a raduction in instructions of ~500 after inlining to ~300 after all the optimization pipeline, with all values in the program being constant. We are going to switch from pedersen to poseidon in aztec, let's see if the faster execution of poseidon2 compensates the more complex SSA generated (pedersen hash is a single opcode) |
Interesting, I tried forcing trying to unroll loops in brillig to test how it'd look like with unrolled loops and it seems like it didn't help much reducing instruction count:
It seems to take advantage of unrolled loops in brillig we could be missing an optimization step for non-flattened control flow that would
|
The last point should be done already by the simplify_cfg pass which does currently apply to brillig code as well.
The second point should be automatic by the function printer, since blocks are actually never removed in general. So it must still be reachable.
This is one area where brillig opts can improve I think. We used to do this automatically in more passes but it was removed for bugs (IIRC) and we just rely on a check during flattening currently. We can try adding this to another pass (simplify_cfg? A different pass?) |
struct EnumEmulation {
a: Option<Field>,
b: Option<Field>,
c: Option<Field>,
}
unconstrained fn main() -> pub Field {
let mut emulated_enum = EnumEmulation { a: Option::some(1), b: Option::none(), c: Option::none() };
for _ in 0..1 {
assert_eq(emulated_enum.a.unwrap(), 1);
}
emulated_enum.a = Option::some(2);
emulated_enum.a.unwrap()
} On current master, this program now compiles down to
|
Ah misread and though this was improved (but still suboptimal), looks like it's unchanged 😅 |
Yeah I think for this to improve mem2reg needs to do more than one pass of the code to gather information about loops ): |
An extreme example is |
Another extreme example, 15% of the runtime in this test is spent evaluating std::compat::is_bn254(), which should be known at compile time (but contains a loop) |
We're currently not inlining any functions in brillig no matter how small afaik. This along with the fact that we don't perform any optimizations during brillig-gen means that these checks make their way into runtime code whereas they should be optimized out. As well as @sirasistant It might be worth looking into a pass which automatically inlines brilllig functions which contain a number of instructions below a certain limit (maybe 5-10 or so?) to get some of these benefits of constant folding without increasing the bytecode size) There's also some overhead associated with performing a brillig function call so that could be used to make sure inlining always reduces the bytecode size. |
Do you mean unroll? we are always inlining in brillig currently unless it's part of a recursive loop |
For this function: unconstrained fn main() {
assert(std::compat::is_bn254());
} This is the final SSA after all optimizations:
|
Ah, my bad. You're right, I misread the inlining pass when checking this. That said, we should be able to unroll loops in any brillig function without arguments or foreign calls safely. |
We definitely can try. I think the main concern before was just increasing code size. Unsure what a reasonable limit is but I'm sure we can think of some arbitrary measure. |
This shouldn't increase code size as in the case where there are no runtime arguments/foreign call results/etc SSA optimizations should be able to simplify the function down into a constant return value. |
To benefit from this however I think we'll need to run these functions which can be simplified down to a constant value through the entire SSA pipeline first before we can inline them into other functions. |
# Description ## Problem\* Resolves <!-- Link to GitHub Issue --> ## Summary\* This PR changes the function `std::compat::is_bn254` to run in a `comptime` context. To do this I've had to implement a few compiler builtins within the interpreter. I've had to mangle `std::compat::is_bn254` a bit to get around restrictions on when a function can be called in a `comptime` context however. This works around around the loop unrolling issue experienced in #4535. ## Additional Context ## Documentation\* Check one: - [x] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [x] I have tested the changes locally. - [ ] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings.
# Description ## Problem\* Resolves <!-- Link to GitHub Issue --> ## Summary\* This PR changes the function `std::compat::is_bn254` to run in a `comptime` context. To do this I've had to implement a few compiler builtins within the interpreter. I've had to mangle `std::compat::is_bn254` a bit to get around restrictions on when a function can be called in a `comptime` context however. This works around around the loop unrolling issue experienced in #4535. ## Additional Context ## Documentation\* Check one: - [x] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [x] I have tested the changes locally. - [ ] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings.
…s-diff report (#5747) # Description ## Problem\* Resolves <!-- Link to GitHub Issue --> ## Summary\* Successful example report: <img width="964" alt="Screenshot 2024-08-19 at 9 31 35 AM" src="https://github.com/user-attachments/assets/ccbddccd-a38a-4853-bcc2-1ac6bcb1fe36"> I originally just had this PR as a draft to test and close, but we can just keep the PR now to add a size regression test for issue #4535: ``` struct EnumEmulation { a: Option<Field>, b: Option<Field>, c: Option<Field>, } unconstrained fn main() -> pub Field { let mut emulated_enum = EnumEmulation { a: Option::some(1), b: Option::none(), c: Option::none() }; assert_eq(emulated_enum.a.unwrap(), 1); emulated_enum.a = Option::some(2); emulated_enum.a.unwrap() } ``` This PR also provides a quicker way of updating the noir-gates-diff commit as the original PR (#5745) will first search for a report on master where a Brillig report does not exist. On this branch we have a reference report on `mv/brillig-opcode-report`. I think we could merge `mv/brillig-opcode-report` into master and then any more commit hash updates for the `noir-gates-diff` repo can be made on this PR. ## Additional Context ## Documentation\* Check one: - [x] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [x] I have tested the changes locally. - [x] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings. --------- Co-authored-by: Tom French <[email protected]>
Was looking at this a bit today in standup and talking with @vezenovm and we noticed the brillig can be improved a bit if we manually make a read only copy before the loop: struct EnumEmulation {
a: Option<Field>,
b: Option<Field>,
c: Option<Field>,
}
unconstrained fn main() -> pub Field {
let mut emulated_enum = EnumEmulation { a: Option::some(1), b: Option::none(), c: Option::none() };
let my_copy = emulated_enum;
for _ in 0..1 {
assert_eq(my_copy.a.unwrap(), 1);
}
emulated_enum.a = Option::some(2);
emulated_enum.a.unwrap()
} With this the new SSA would be:
Which is 4 fewer instructions within the loop compared to Tom's example. I could see this being important especially for larger loops. It should also be a fairly easy optimization for users to apply even if it's difficult for the compiler to identify this automatically. Another thing we noticed is these instructions at the end:
Where we're loading a value only to immediately store it again. In mem2reg when we load from a reference whose value is unknown, that value will stay unknown afterward, but this doesn't need to be the case! We now know that the reference loads to Another check we could add though is for With all these optimizations the (theoretical) SSA would be:
If we can identify that
|
That sounds great!
we can probably update the unroller to delete empty loops like this one |
I get to this point in #5865 if we do a copy outside the loop and read the copy in the loop:
Just catching an edge case for arrays w/ inc_rc, but the optimization looks promising. |
…that store (#5910) # Description ## Problem\* Working towards #4535 ## Summary\* Small optimization found while working in mem2reg. Now that we have a small per function state we can see which stores/loads that are leftover can be removed. A place we are potentially missing stores that can be removed is in return blocks. If we have a store inside of a return block that is not loaded from inside that block, we can safely remove that store (given it is not a reference param). ## Additional Context ## Documentation\* Check one: - [X] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [X] I have tested the changes locally. - [X] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings. --------- Co-authored-by: Tom French <[email protected]> Co-authored-by: TomAFrench <[email protected]>
… mem2reg (#5935) # Description ## Problem\* Partially resolves #4535 Replaces #5865 ## Summary\* When we see a load we mark the address of that load as being a known value of the load result. When we reach a store instuction, if that store value has a known value which is equal to the address of the store we can remove that store. We also check whether the last instruction was an `inc_rc` or a `dec_rc`. If it was we do not remove the store. ## Additional Context ## Documentation\* Check one: - [X] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [X] I have tested the changes locally. - [X] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings.
…on (#5925) # Description ## Problem\* Partially resolves #4535. This does not necessarily make SSA "aware of loops" but handles the situation that we in the issue where we have multiple known stores and loads that we are unable to resolve across the function due to loops. ## Summary\* The per function state and the logic run over it has been expanded. We store more information in `last_loads`, a new `load_results` map to track unused loads, and a utility map `calls_reference_input` to avoid deleting stores that are directly passed into an entry point with a mutable reference. To count the remaining stores, I reference count both the results of the loads and the address of the loads. If the results of a load has a counter equal to 0 I remove that load instruction. I then keep track of each removed address with its own reference counter. If the removed loads counter is equal to the counter of the address of the loads, we know we have removed all loads to an address and thus can safely remove the store to that address. We also added a check that store we want to remove is not used as an argument into a function call. I have added a unit test `remove_unused_loads_and_stores` inside of `mem2reg.rs`. The `brillig_loop_size_regression` test: ```noir unconstrained fn main() -> pub Field { let mut emulated_enum = EnumEmulation { a: Option::some(1), b: Option::none(), c: Option::none() }; for _ in 0..1 { assert_eq(emulated_enum.a.unwrap(), 1); } emulated_enum.a = Option::some(2); emulated_enum.a.unwrap() } ``` now compiles to the optimal SSA after a single run of mem2reg: ``` After Mem2Reg: brillig fn main f0 { b0(): v1 = allocate v2 = allocate v3 = allocate v4 = allocate v5 = allocate v6 = allocate jmp b1(u32 0) b1(v0: u32): v8 = eq v0, u32 0 jmpif v8 then: b3, else: b2 b3(): v11 = add v0, u32 1 jmp b1(v11) b2(): return Field 2 } ``` ## Additional Context There is most likely other ways we can utilize this per function state. We can continue to iterate upon it in follow-ups. ## Documentation\* Check one: - [X] No documentation needed. - [ ] Documentation included in this PR. - [ ] **[For Experimental Features]** Documentation to be submitted in a separate PR. # PR Checklist\* - [X] I have tested the changes locally. - [X] I have formatted the changes with [Prettier](https://prettier.io/) and/or `cargo fmt` on default settings. --------- Co-authored-by: Tom French <[email protected]> Co-authored-by: jfecher <[email protected]>
Problem
This code:
is functionally the same as
However, their optimized SSAs are completely different
vs
I think mem2reg is failing to promote to reg values that are used across multiple blocks
Happy Case
Ideally, the two programs should generate the same optimized SSA
Project Impact
Nice to have
Impact Context
This generates a big blowup in bytecode size in aztec's public functions. See AztecProtocol/aztec-packages#5153
Workaround
Yes
Workaround Description
We can reduce the amount of unused references by using traits instead of emulating enums via options, this is the case for the context enum, that we could transform into a context trait.
Additional Context
No response
Would you like to submit a PR for this Issue?
None
Support Needs
No response
The text was updated successfully, but these errors were encountered: