Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive LLVM time in egal codegen of large struct #54109

Closed
Keno opened this issue Apr 17, 2024 · 4 comments · Fixed by #54121
Closed

Excessive LLVM time in egal codegen of large struct #54109

Keno opened this issue Apr 17, 2024 · 4 comments · Fixed by #54121
Labels
compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM performance Must go faster

Comments

@Keno
Copy link
Member

Keno commented Apr 17, 2024

This is similar to #44998, in that LLVM's SLPVectorizer is involved, but I think it's easier to solve by tweaking the codegen for egal:

struct DefaultOr{T}
   x::T
   default::Bool
end

@eval struct Torture
    $((Expr(:(::), Symbol("x$i"), DefaultOr{Float64}) for i = 1:897)...)
end

egal_any(x::Torture, y::Any) = x === y
julia> @time code_llvm(egal_any, Tuple{Torture, Any})
 22.034327 seconds (5.48 M allocations: 206.847 MiB, 0.40% gc time, 88.69% compilation time: <1% of which was recompilation)
@vtjnash
Copy link
Member

vtjnash commented Apr 17, 2024

I think Oscar was proposing making this code more branch-y, which should help defeat the vectorizer. All those undef padding bits otherwise get in the way of doing simple loops over the bits

@gbaraldi
Copy link
Member

Does the padding stop us from emmiting a memcpy?

@Keno
Copy link
Member Author

Keno commented Apr 17, 2024

Yes, the padding is forcing us to emit this unrolled. I think a reasonable implementation here would be to RLE the padding bit pattern and then emit the compare as a sequence of loops with an early out between each block. That should allow the loop vectorizer to emit the correct target-specific comparison sequence for each bit pattern as well as giving it license to early out the loop, without forcing that semantically.

@gbaraldi
Copy link
Member

We should probably vendor the expand memcmp code llvm has. Not sure if there is anything that we can annotate the loop to say, hey we don't care if you early/late exit this

@giordano giordano added performance Must go faster compiler:llvm For issues that relate to LLVM compiler:codegen Generation of LLVM IR and native code labels Apr 17, 2024
Keno added a commit that referenced this issue Apr 17, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 17, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 24, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 24, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 24, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 24, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 25, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
Keno added a commit that referenced this issue Apr 25, 2024
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
@Keno Keno closed this as completed in 50833c8 Apr 25, 2024
vtjnash added a commit that referenced this issue Feb 18, 2025
This reverts a portion of commit 50833c8.

This algorithm is not able to handle simple cases where there is any
internal padding, such as the example of:
```
struct LotsBytes
    a::Int8
    b::NTuple{256,Int}
    c::Int
end
```

Unfortunately fixing it is a bit of a large project right now, so
reverting now to fix correctness while working on that.

Fixes #55513 (indirectly, by removing broken code)
Maybe reopens #54109, although the latency issue it proposes to fix
doesn't occur on master even with this revert (just the mediocre looking
IR result output returns)
vtjnash added a commit that referenced this issue Feb 19, 2025
This reverts a portion of commit
50833c8.

This algorithm is not able to handle simple cases where there is any
internal padding, such as the example of:
```
struct LotsBytes
    a::Int8
    b::NTuple{256,Int}
    c::Int
end
```

Unfortunately fixing it is a bit of a large project right now, so
reverting now to fix correctness while working on that.

Fixes #55513 (indirectly, by removing broken code) Maybe reopens #54109,
although the latency issue it proposes to fix doesn't occur on master
even with this revert (just the mediocre looking IR result output
returns)
KristofferC pushed a commit that referenced this issue Feb 21, 2025
This reverts a portion of commit
50833c8.

This algorithm is not able to handle simple cases where there is any
internal padding, such as the example of:
```
struct LotsBytes
    a::Int8
    b::NTuple{256,Int}
    c::Int
end
```

Unfortunately fixing it is a bit of a large project right now, so
reverting now to fix correctness while working on that.

Fixes #55513 (indirectly, by removing broken code) Maybe reopens #54109,
although the latency issue it proposes to fix doesn't occur on master
even with this revert (just the mediocre looking IR result output
returns)

(cherry picked from commit a65c2cf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants