-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvements to resize kernel (w/ SIMD) #1513
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
42632c7
Add initial FMA resize kernel convolve implementation
Sergio0694 3f7deb5
Improved loading of factors using permutation
Sergio0694 874e951
Switch from FMA to AVX2 instructions
Sergio0694 941e173
Revert to FMA, codegen improvements
Sergio0694 493d04a
Add unrolled FMA loop
Sergio0694 407c2d9
Add missing indexing update
Sergio0694 a7ca1b0
Workaround for incorrect codegen on .NET 5
Sergio0694 e2211c3
Update image threshold for resize tests
Sergio0694 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance.
Thinking it further:
Would make the code more tricky, but maybe we can try to process two
Vector256<float>
-s in the loop body, so we can run 4 permutes in a row, then do 4+4 FMA-s.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
I think @saucecontrol is doing only two ops per iteration as well in his own lib because of this? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity.
The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with
Vector<T>
, allowing AVX processing on netfx.