-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zip: Handle preferred memory layout of inhomogenous inputs better #809
Conversation
They were previously pub, just because we didn't have the pub(crate) feature yet.
Benchmark after this PR (somewhat noisy results, so I don't have a comparable before-after). The ones that improved in this PR, were all examples of "ff" that improved to match "cc":
The rest should have been unchanged. Benchmarks contributed by @nilgoyette and some extra cases added by me. test slice_split_zip_cc ... bench: 1,422,302 ns/iter (+/- 389,440)
test slice_split_zip_ff ... bench: 1,438,825 ns/iter (+/- 696,813)
test slice_zip_cc ... bench: 1,464,805 ns/iter (+/- 525,807)
test slice_zip_ff ... bench: 1,469,397 ns/iter (+/- 318,191)
test zip_cc ... bench: 760,087 ns/iter (+/- 140,810)
test zip_ff ... bench: 758,991 ns/iter (+/- 236,212)
test zip_indexed_cc ... bench: 4,733,670 ns/iter (+/- 959,597)
test zip_indexed_ff ... bench: 4,151,390 ns/iter (+/- 923,228)
test zip_mut_with_cc ... bench: 725,328 ns/iter (+/- 120,358)
test zip_mut_with_ff ... bench: 718,780 ns/iter (+/- 170,944) The reason the index benchmarks are much slower than reported in #749, is that here the index values are |
A big thank you for this PR! I tried doing it several months ago and abandoned, so I'm quite happy to have nerd sniped you ;) I ran the benchmarks and, as you wrote, they are noisy so it's kind of hard to spot the tiny changes. I ran them 5-6 times each and picked the best results. Having said this, it looks like the 'cc' cases are a little slower than they were, almost nothing, and probably in the noise range, but the 'ff' cases are now equal and this is a great news for us because we are stuck with loading and writing images in fortran order.
Just a note, there are now 2 benchmarks named |
Thanks for running benchmarks! Reducing the overhead of Zip would be interesting. I'll try to deduplicate the benchmarks and maybe remove some of the mixed benchmarks here, I don't want to run them anyway, even if they are useful for comparison and perspective. Since you ran all benchmarks I'll share my tip of only compiling and running what you need, which we use to cope with building rust: My develop machine has changed, and the new one is very flaky at benchmarks. I can maybe see the point of criterion now; I used to have a setup that made for stable and reproducible benchmarks before. |
Using split tests performance of the Zip in parallelization, so that we can see if there are benefits to splitting arrays better.
Using the index shows more directly the overhead of indexed zip
Support both unroll over c- and f-layout preferred axis in Zip inner loop (the fallback when inputs are not all contiguous and same layout). Keep a tendency score when building the Zip, so that we know if the inputs are tending to be c- or f- layout. This improves performance on the just added zip_indexed_ff benchmark, so that it seems to match its (already fast) cc counterpart.
cd21da6
to
47b3654
Compare
Clippy is inexplicably emitting a lint that we are allowing, reproduces locally. It doesn't touch the code in the pr, so it's a separate issue. |
For example, when we use
Zip::from(a).and(b)
; the Zip will examine the inputs and try to determine if they are all contiguous (and in the same way); it can now also determine what tendency the inputs have, to further guide which axis should be used for the innermost loop, even if not all the inputs are contiguous.This helps for example with indexed Zip on f-order producers. The index producer has no bias in either direction, so all the other inputs will determine the layout preference.
The improved layout preference also affects parallelism, because in some cases we can better choose which axis to split along to preserve locality better.
The
Layout
type was improved to make this possible. It now has flags for C/F-contig and for C/F-preference. The new layout bits are visible in the array debug output.Fixes #749