-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an overflow check in the Iter::next() impl for Range<_> to help with vectorization. #43595
Conversation
This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec<u16>>(), as LLVM is able to change the loop condition to use equality instead of less than
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @BurntSushi (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
A slight downside to this change is that it makes the function slightly slower in non-optimized builds due to the extra check. |
If the problem here is that external iteration is slow, as it needs to recheck things, should vec's SpecExtend switch to using the new |
@scottmcm I wasn't aware of I suppose using for_each may also be helpful for performance in non-optimized builds, depending on how closures work before optimizations. The change suggested in your pr (#43351) may also help a bit on this front, given how swap does a bunch of stuff that's not needed for the small types used in ranges, but I haven't done any tests on that yet. If we want somewhat fast iteration over ranges in non-optimized builds I think having some specializations on ranges primitive integers that avoids all the indirection would make sense. Anyhow, all this may be better suited for a later PR. As for this PR, I still think it makes sense to have this change in .next for now though as it may help with vectorization in other cases as well, and I think a 20(+)x speedup in release mode is worth a 10-20% slowdown in non-optimized builds (that has a fair bit of potential for improvement later anyhow.) I might have a look at the other suggested improvements to vector extends in particular later, maybe we can also find out what is the cause of the remaining difference between this and the unsafe implementation. It's going to be a few days until I get home and can put in some larger amount of work on this anyhow. |
I did a small test by creating a new Iterator type with the relevant parts of the Range struct, and using != as the condition does indeed do the same job as an overflow check but avoids having an extra check in non-optimized mode. This was provided the struct implemented TrustedLen like Range so that the specialization for those types is used. (Interestingly, when using <, not implementing TrustedLen made the collect function twice as fast.) So that could be used for Range::next() for a for_each or other specialization where one can be sure start hasn't been set higher than end (which is possible as the fields are public). |
r? @aturon |
@bors: r+ Thanks for this PR! |
📌 Commit 4bb9a8b has been approved by |
Add an overflow check in the Iter::next() impl for Range<_> to help with vectorization. This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec<u16>>(), as LLVM is able to change the loop condition to use equality instead of less than and should help with #43124. (See also my [last comment](#43124 (comment)) there.) This PR makes collect on ranges of u16, i16, i8, and u8 **significantly** faster (at least on x86-64 and i686), and pretty close, though not quite equivalent to a [manual unsafe implementation](https://is.gd/nkoecB). 32 ( and 64-bit values on x86-64) bit values were already vectorized without this change, and they still are. This PR doesn't seem to help with 64-bit values on i686, as they still don't vectorize well compared to doing a manual loop. I'm a bit unsure if this was the best way of implementing this, I tried to do it with as little changes as possible and avoided changing the step trait and the behavior in RangeFrom (I'll leave that for others like #43127 to discuss wider changes to the trait). I tried simply changing the comparison to `self.start != self.end` though that made the compiler segfault when compiling stage0, so I went with this method instead for now. As for `next_back()`, reverse ranges seem to optimise properly already.
☀️ Test successful - status-appveyor, status-travis |
This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec>(),
as LLVM is able to change the loop condition to use equality instead of less than and should help with #43124. (See also my last comment there.) This PR makes collect on ranges of u16, i16, i8, and u8 significantly faster (at least on x86-64 and i686), and pretty close, though not quite equivalent to a manual unsafe implementation. 32 ( and 64-bit values on x86-64) bit values were already vectorized without this change, and they still are. This PR doesn't seem to help with 64-bit values on i686, as they still don't vectorize well compared to doing a manual loop.
I'm a bit unsure if this was the best way of implementing this, I tried to do it with as little changes as possible and avoided changing the step trait and the behavior in RangeFrom (I'll leave that for others like #43127 to discuss wider changes to the trait). I tried simply changing the comparison to
self.start != self.end
though that made the compiler segfault when compiling stage0, so I went with this method instead for now.As for
next_back()
, reverse ranges seem to optimise properly already.