-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline singleton splats #36169
Inline singleton splats #36169
Conversation
As noted in #36087 and #29114, splatting integers currently has a performance penalty that is unexpected. For tuples and SimpleVectors, we have special purpose inliners that will simply inline the tuple/SimpleVector into the call being splatted. However, for everything else we'd have to run the iteration protocol to find out what the values to substitute are. This change does just that, limited to the case of length-1 (and empty) iterables. Benchmark: ``` f(x) = (x...,) @code_typed f(1) @benchmark f(1) ``` Before: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 2 -------------- minimum time: 209.357 ns (0.00% GC) median time: 213.404 ns (0.00% GC) mean time: 218.674 ns (0.16% GC) maximum time: 1.922 μs (0.00% GC) -------------- samples: 10000 evals/sample: 540 ``` After: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing} │ %2 = (getfield)(%1, 1)::Int64 │ %3 = (getfield)(%1, 2)::Nothing │ invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing │ %5 = Core.tuple(%2)::Tuple{Int64} └── return %5 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 3.044 ns (0.00% GC) median time: 3.047 ns (0.00% GC) mean time: 3.049 ns (0.00% GC) maximum time: 7.700 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Obviously this isn't 100% optimal yet, because the `iterate` calls themselves don't get inlined, but it's a lot better. Inlining the `iterate` calls is left for a follow up commit.
I know my way around the optimizer too little to review in detail, but IIUC |
Yes, we don't currently have any infrastructure to forward sideband information from inference to the optimizer (other than the cache for method lookups that we are using here). I'd like to improve that in the future, but for now this is the best we can do. |
Thanks for confirming. (And by all means, that does not mean I disapprove of this PR in its current form in any way.) |
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This change attempts to be a solution to the generalized problem encountered in #36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what #36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to #36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
Superseded by #36684 |
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This change attempts to be a solution to the generalized problem encountered in JuliaLang#36169. In short, we do a whole bunch of analysis during inference to figure out the final type of an expression, but sometimes, we may need intermediate results that were computed along the way. So far, we don't really have a great place to put those results, so we end up having to re-compute them during the optimization phase. That's what JuliaLang#36169 did, but is clearly not a scalable solution. I encountered the exact same issue while working on a new AD compiler plugin, that needs to do a whole bunch of work during inference to determine what to do (e.g. call a primitive, recurse, or increase the derivative level), and optimizations need to have access to this information. This PR adds an additional `info` field to CodeInfo and IRCode that can be used to forward this kind of information. As a proof of concept, it forwards method match info from inference to inlining (we do already cache these, so there's little performance gain from this per se - it's more to exercise the infrastructure). The plan is to do an alternative fix to JuliaLang#36169 on top of this as the next step, but I figured I'd open it up for discussion first.
This supersedes JuliaLang#36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in JuliaLang#36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes JuliaLang#36087 Fixes JuliaLang#29114
As noted in #36087 and #29114, splatting integers currently has a
performance penalty that is unexpected. For tuples and SimpleVectors,
we have special purpose inliners that will simply inline the
tuple/SimpleVector into the call being splatted. However, for
everything else we'd have to run the iteration protocol to find
out what the values to substitute are. This change does just that,
limited to the case of length-1 (and empty) iterables. Benchmark:
Before:
After:
Obviously this isn't 100% optimal yet, because the
iterate
calls themselvesdon't get inlined, but it's a lot better. Inlining the
iterate
calls isleft for a follow up commit.