-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regressions porting Jetscii from inline assembly to intrinsics #401
Comments
Is it possible to look at and compare the codegen? |
Initial digging showed that I was making calls to the intrinsics: 0x100024cca <+26>: callq 0x100023a80 ; core::coresimd::x86::sse42::_mm_cmpestrm::h56dc3784ab1ef3fc (.llvm.8661840455491943110) @alexcrichton points out that if you see calls it probably means the target feature business is mixed up: the caller doesn't have the I added appropriate
|
The next bit of comparison shows that my trait implementation functions are no longer being inlined to their call site. Adding Adding
Rewriting the callsite to (a) be unsafe (b) add
Of note is that I forgot to transform all callsites at first and got this wonderful LLVM error:
|
I reran my benchmarks with a slightly less loaded machine and the intrinsics appear to be a net win at the moment:
|
I do have some outstanding questions / comments:
|
This is a known Rust bug. Making a However, functions cannot be inlined across incompatible target features without introducing all sort of weird undefined behavior. So if you attempt to call an
How were you benchmarking? |
No, I was simply running the benchmark ( Rolling back to the version before I tweaked everything for inlining, and running with that This is both good (yay, I don't have to add as much cruft!) but also bad (how am I going to explain to every user and downstream user that they need to compile with "magic flags"?). |
@shepmaster |
So does that mean that using
Let's leave that out for now and try to come up first with a version of the code that uses the intrinsics, yet has no problems with inlining nor |
@shepmaster if possible, can you open a dedicated issue for the LLVM error you got? That's definitely a bug in rustc and/or stdsimd, we want to hide users from those at all costs! |
Yes, it does.
Done in #404 |
So first I'd like to explain why so many changes are required. Before, if you run // crate root
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), not(target_feature = "sse4.2")))]
compile_fail!("jetscii x86/x86_64 requires a target with SSE4.2 enabled. Please compile these targets with RUSTFLAGS=\"-C target-feature=+sse4.2\"");
#[inline(always)]
#[target_feature = "sse4.2"]
unsafe fn foo_impl() { ... call intrinsics here }
// This is always safe because of the compile fail above:
#[inline(always)] fn foo() { unsafe { foo_impl() } } Now, if you want to make mod myalgo {
cfg_if! {
if #[cfg(any(target_arch = "x86", target_arch = "x86_64))] {
mod x86;
pub use x86::myalgo;
} else if {
mod arm;
pub use arm::myalgo;
} else {
mod fallback;
pub use fallback::myalgo;
}
} and then at the mod x86 {
mod sse42;
#[path = "../fallback.rs"] // reuse fallback here
mod fallback;
#[inline]
fn myalgo() {
if is_x86_feature_detected("sse4.2") {
unsafe { sse42::myalgo() }
} else {
fallback::myalgo()
}
}
} Obviously, for [0]: the number of times one writes |
The key to how I do things is this type: /// A builder for SSSE3 empowered vectors.
///
/// This builder represents a receipt that the SSSE3 target feature is enabled
/// on the currently running CPU. Namely, the only way to get a value of this
/// type is if the SSSE3 feature is enabled.
///
/// This type can then be used to build vector types that use SSSE3 features
/// safely.
#[derive(Clone, Copy, Debug)]
pub struct SSSE3VectorBuilder(()); The only way for a consumer outside of this module to get a value with type /// Create a new SSSE3 vector builder.
///
/// If the SSSE3 feature is not enabled for the current target, then
/// return `None`.
pub fn new() -> Option<SSSE3VectorBuilder> {
if is_x86_feature_detected!("ssse3") {
Some(SSSE3VectorBuilder(()))
} else {
None
}
} And in turn, the only way for this constructor to return a non- In that same module, I defined my own vector type (using a macro so that things work on Rust 1.12): // We define our union with a macro so that our code continues to compile on
// Rust 1.12.
macro_rules! defunion {
() => {
/// A u8x16 is a 128-bit vector with 16 single-byte lanes.
///
/// It provides a safe API that uses only SSE2 or SSSE3 instructions.
/// The only way for callers to construct a value of this type is
/// through the SSSE3VectorBuilder type, and the only way to get a
/// SSSE3VectorBuilder is if the `ssse3` target feature is enabled.
///
/// Note that generally speaking, all uses of this type should get
/// inlined, otherwise you probably have a performance bug.
#[derive(Clone, Copy)]
#[allow(non_camel_case_types)]
pub union u8x16 {
vector: __m128i,
bytes: [u8; 16],
}
}
}
defunion!(); In particular, the only way for a consumer to get a #[inline]
pub fn ne(self, other: u8x16) -> u8x16 {
// Safe because we know SSSE3 is enabled.
unsafe {
let boolv = _mm_cmpeq_epi8(self.vector, other.vector);
let ones = _mm_set1_epi8(0xFF as u8 as i8);
u8x16 { vector: _mm_andnot_si128(boolv, ones) }
}
} This might seem like a lot of ceremony, but these vectors are used in a fairly complex SIMD algorithm called Teddy, and this was the only way I could figure out how to "isolate" the I don't mean to suggest you should adopt this approach for jetscii unless you feel like it will work well for you, but rather, to plant the seed of using the type system in some way to control One potential downside to all of this is that it can be easy to introduce performance bugs, and to some extent, it's not clear whether I'm relying on LLVM bugs to get performance correct here. In particular, I only use (To be clear, compile time flags aren't really an option to me. IMO, people should be focusing on using runtime detection as much as possible, to make their optimizations apply with the lowest possible friction.) |
Oh sorry this is not what I mean. What I meant is that before porting the code to
I think that all those methods should be
Yes. I was just pointing out that the code did not use this before, and that the minimal path towards using |
Oh I see. I thought @shepmaster was doing a CPU ID check before though. Now that I look at the code, I don't see any CPU ID check, so yeah, I see what you mean now.
The first problem is mildly annoying, but I could easily live with it. The second problem is really what drove me to my current solution. But yes, I think it is interesting to question whether those methods should have a
Ah I see, yeah, that's true if you were using the |
Basically all these methods are inside the same crate and LLVM is just doing inlining as usual. Inlining functions into other functions that extend their target feature set is ok, so inlining non-ssse3 functions into an ssse3 functions is perfectly fine. Problems can only arise when LLVM does not inline an intermediary function that does not have the ssse3 target feature attribute. In that case, ssse3 functions won't be inlineable into it either. I personally think that code should express intent. In this case, your intent is clearly for all that code to be compiled with This requires you to make these functions Currently there is no way to express this intent in Rust, but I think this is a major ergonomic issue with the current |
This seemed very worrisome to me at first (bad LLVM!) but looking more into this I think that's ok. I'd guess that what's happening here is that LLVM is inlining methods without I suspect that if you didn't tag the top method with This actually seems like it could be a really cool trick one day... impl SSSE3VectorBuilder {
#[target_feature(enable = "ssse3")]
pub fn run<F, R>(&self, f: F) -> R
where F: FnOnce() -> R
{ f() } Somehow you could probably assert to the compiler that the
I'd probably shy away from it for now, but I could see it being a possibility! |
Oh yes, indeed! I experimented with that when I was writing the code to check my understanding, and that was indeed the case. Performance regressed and none of the intrinsics were inlined (as expected). @gnzlbg I basically agree with you about intent. I just happen to come down on the side of "I'd rather obscure intent a little in favor of isolating |
I think we should pursue a minimal step towards what @hsivonen proposed in his RFC. Currently, we require that #[target_feature = "sse2"] unsafe fn foo() { }
#[target_feature = "sse2"] fn bar() { }
fn meow() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // ERROR (meow is not sse2)
unsafe { bar() }; // ERROR (meow is not sse2)
}
#[target_feature = "sse2"]
fn bark() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // OK (bark is sse2)
unsafe { bar() }; // OK (bark is sse2)
}
#[target_feature = "avx"] // avx != sse2, see [0]
fn moo() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // ERROR (bark is not sse2)
unsafe { bar() }; // ERROR (bark is not sse2)
} This minimal step would probably solve most of @BurntSushi 's pain points. An incremental improvement over this would be to allow calling these "safe" target feature functions from functions that do not have the same target feature by using an #[target_feature = "sse2"] unsafe fn foo() { }
#[target_feature = "sse2"] fn bar() { }
fn meow() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // ERROR (meow is not sse2)
unsafe { bar() }; // OK - was an error before
}
#[target_feature = "sse2"]
fn bark() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // OK (bark is sse2)
unsafe { bar() }; // OK (bark is sse2)
}
#[target_feature = "avx"]
fn moo() {
foo(); // ERROR (unsafe block required)
unsafe { foo() }; // OK
bar(); // ERROR (bark is not sse2)
unsafe { bar() }; // OK - was an error before
} These two improvements would already deliver a lot of bang-per-buck. They would allow everybody to write safe For things like trait methods and function pointers, trait Foo { fn foo(); }
struct Fooish();
impl Foo for Fooish {
#[target_feature = "sse2"] fn foo() { }
// ^ ERROR: #[target_feature] on trait method impl requires
// unsafe fn but Foo::foo is safe
}
trait Bar { unsafe fn bar(); }
struct Barish();
impl Bar for Barish {
#[target_feature = "sse2"] unsafe fn bar() { } // OK
}
#[target_feature] fn meow() {}
static x: fn () -> () = meow;
// ^ ERROR: meow can only be assigned to unsafe fn pointers due to
// #[target_feature] but function pointer x with type fn()->() is safe.
static y: unsafe fn () -> () = meow; // OK [0]: to keep things minimal and the discussion focused I'd rather stay away from feature hierarchies initially. |
I am really happy that the two of you ironed out my (evidently poorly worded) concern: fn driver() {
let a = alpha();
let b = beta();
}
#[inline]
#[target_feature(...)]
fn alpha() { /* some intrinsic */ }
#[inline]
#[target_feature(...)]
fn beta() { /* some intrinsic */ } In this code,
I concur 100% |
@BurntSushi I wanted the best of both worlds, so I'm pursuing something that uses the compile-time settings when it can and runtime detection when it cannot: pub struct Searcher<F>
where
F: Fn(u8) -> bool,
{
// Include this implementation only when compiling for x86_64 as
// that's the only platform that we support.
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fast: Fast,
// If we are *guaranteed* to have SSE 4.2, then there's no reason
// to have this implementation.
#[cfg(not(target_feature = "sse4.2"))]
fallback: Fallback<F>,
// Since we might not use the fallback implementation, we add this
// to avoid unused type parameters.
_fallback: PhantomData<F>,
}
impl<F> Searcher<F>
where
F: Fn(u8) -> bool,
{
pub /* const */ fn new(bytes: &[u8], fallback: F) -> Self {
Searcher {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fast: Fast::new(bytes),
#[cfg(not(target_feature = "sse4.2"))]
fallback: Fallback::new(fallback),
_fallback: PhantomData,
}
}
#[inline]
pub fn find(&self, haystack: &[u8]) -> Option<usize> {
// If we can tell at compile time that we have support,
// call the optimized code directly.
#[cfg(target_feature = "sse4.2")] {
unsafe { self.fast.find(haystack) }
}
// If we can tell at compile time that we will *never* have
// support, call the fallback directly.
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64")))] {
self.fallback.find(haystack)
}
// Otherwise, we will be run on a machine with or without
// support, so we perform runtime detection.
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
not(target_feature = "sse4.2")))] {
if is_x86_feature_detected!("sse4.2") {
unsafe { self.fast.find(haystack) }
} else {
self.fallback.find(haystack)
}
}
}
} |
To be clear, I'm not complaining about any extra code I have to write. Please note that Jetscii was written 3 years ago using inline assembly and the entire reason that I wrote the Cupid crate was to work towards having runtime detection, I just... never got around to it 😸 . I am only attempting to provide feedback based on porting the algorithms expressed within to stdsimd. |
In a magical world, I'd love it if |
@gnzlbg I'd be on board with making |
Gotcha. Please don't misunderstand my words as criticism towards jetscii. I was just pointing out that the way the intrinsics are currently designed basically force crates to do feature-detection either at compile-time or at run-time, and that's something that very few crates in the Rust (and probably C and C+) ecosystem are currently doing because it wasn't required before.
Those extensions would be nice to have as well, and I guess that if you add them you are probably really close to where the discussion in @hsivonen's RFC ended up. I think that these would be harder though. There is currently an open bug about this function (playground): #[target_feature(enable = "avx")]
unsafe fn foo() -> bool {
#[cfg(target_feature = "avx")] { true }
#[cfg(not(target_feature = "avx"))] { false }
} where per RFC2045 it should always return
But this is where we currently are. Today, What I was suggesting is to add a way in which the compiler asserts this for you, and reliably produce an error if the assertion fails. We could use a different attribute, e.g., |
Why would that be better than the "Minimal target feature unsafe" RFC? |
What is the right path to report "poor" assembly? A tight loop of code is generating this: +0xc0 movdqu (%rsi,%r10), %xmm1 ;; 0.3%
+0xc6 movl $1, %eax ;; 23.1%
+0xcb movl $16, %edx ;; 0.4%;
+0xd0 pcmpestri $0, %xmm1, %xmm0 ;; 0.2%;
+0xd6 jb "find+0x130" ;; 64.4%
+0xd8 incq %rbx ;; 11.2%
+0xdb addq $16, %r10 ;; 0.3%
+0xdf cmpq %r11, %rbx ;; 0.2%
+0xe2 jb "find+0xc0" The two initial |
I've personally found that instruction profiling is typically off-by-one, so the 64% is probably |
@shepmaster the best path is probably to use rust.godbolt.org to create a minimal example that shows the issue. Once you have that, save that link. Then, use We can then take a look, clean up the IR, and answer the question: is rustc emitting sensible LLVM-IR ? And also check for the same example, which LLVM-IR does clang emit. If the LLVM-IR is not getting optimized properly we can fill in an LLVM bug. If Rust is emitting poor LLVM-IR that would be actually good news, since that is something that we might be able to fix quickly on Rust's end. An alternative is for you to provide a minimal example on rust.godbolt.org and I can take over the investigation from there if you want. |
I believe the outstanding issues here have since been resolved so closing |
I ported Jetscii to use stdsimd with the belief that it will be stabilized sooner 😜.
There's a stdsimd branch in case you are interested in following along at home.
The initial port is roughly 60% of the original speed:
Takeaways
#[target_feature]
(and/or-C target-feature
)?The text was updated successfully, but these errors were encountered: