-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve constant-timeness on PowerPC #772
Improve constant-timeness on PowerPC #772
Conversation
Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 32 GCC 10.1.1 PPC64LE! Bad news: I'm reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there. I don't currently have a ready to go i386 testing setup-- but I'll set one up if you'll work with me on this. PPC64LE also is non-constant time on:
I didn't see a clear way to fix that. (also I think this highlights a greater problem that we can't trust statements like this to be constant time. I think we have a lot of them, they're not currently an issue through luck of the optimizer). I'm happy to test PPC64LE btw (btw: luke-jr, grubles, bluematt, and td-linux on IRC also have ppc64 systems-- probably other people too, it's probably easier to get developers to test on PPC64LE right now, though i386 is still a relevant embedded/old system target). |
Aside, the 64-bit scalar code is also full of the "pointless" expressions-- on ppc64 w/ this compiler they don't compile into something valgrind complains about. ... and also I think they're there for the same reason they are there in the 32-bit code. (aside, wow travis takes an absurdly long time to run) |
Hooray.
Yeah, given that this syntax is literally redundant, I suspected it was there for a reason. I had a closer look now. It seems to me that the only "bad" instruction is Changing this single line to
Cool, let's do this.
Hm, yeah, that one seems harder. Have you tried a shift or a conditional operator (haha) or a conditional operator with a reversed condition (hahaha)?
Yeah but sadly that's still state of the art for this stuff. |
When we've more fully characterized the situation we can also take our pleadings to the compiler gods and maybe get some advice that will result in something more stable (or get an unintended 'optimization' turned off). |
Great news, valgrind still passes with this on i386 debian 8 (gcc 4.9.2), I've tried a mixture of compiler options (Os,O2, and mtune i486/pentium3/pentium4). I'm going to go and also try on a more modern x86 OS too and some other systems. Assuming it all goes good I think we should remove all the useless ternaries but we should also perhaps put in an effort to figure out why they were there: I thought they were there to prevent branches, but it may be that we just copied the behaviour from openssl and assumed it was preventing branches. |
I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468. |
FWIW w/ default current master emits a branch on x86 alpine linux (GCC 9.3.0) on the same ECDH sign line as on PPC64LE. |
Okay, so for the "sign = 2 * (u_last > 0) - 1;" on both Fedora PPC64 LE GCC 10.1.1 and Alpine i386 GCC 9.3.0 which emit a branch there, the issue is avoided by using "sign = -(u_last <= 0)|1;" There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I'm concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can't see a lesson we can learn here about what not to do). The alternative "sign = -(u_last < 0)|1;" is not an equivalent expression but also passes the tests and I think seems more natural. So assuming someone reviews the surrounding code and it's okay to have 0 end up with a sign of 1 instead of -1, and that this doesn't compile into anything awful on x86_64, I thinks that is probably what it should use. We should probably figure out which of the alternatives are faster/cleaner and then just try out a complete suite of fixes (including getting rid of all the pointless ternaries, since we can't find any place where they help and we've found a place where they hurt) on as many targets as we can. |
Oh, the better fix is #741, which just needs about one more ACK. I just wanted to point out that we reviewed the surrounding code already and I had not remembered earlier that it just removes the variable entirely. Sorry for wasting some of your time. |
Oh, sounds fine to me! I'm don't mind chasing some blind alleys so long as things improve. |
I'm sure it was copied from OpenSSL, the code says so: The code was written in this 21-years old commit openssl/openssl@fb81ac5 by @dot-asm. I'm sure I have no clue what I did 20 years ago but maybe we're lucky. @dot-asm, do you remember the reason for the
Yeah, and I believe it's getting worse over time with more advanced optimizations. Sometimes a big hammers like in #728 just does it but this would not acceptable here for performance reasons. |
I've now tested a bunch of older compilers on x86 and actually find the newer ones seem to be generally more reliable-- e.g. you don't have to go far back with clang before its spewing branches everywhere-- e.g. managing to turn scalar_cmov into a ton of branches. GCC 2.95 does too, but it looks like anything ECGS or later only rarely has issues (like we're discussing in this PR). Unfortunately, I think the story is really just that the compiler behaviour isn't stable. When we wrote a bunch of this library we were in an island of good behaviour by compilers, but it might have just been luck. Once we've fixed what we can for more important platforms, I'll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn't perfect. |
Are we aware of further issues on a more important platform? There's #711, which is also a constant-time PR that needs review. It would be neat to have set of compilers and targets, where we can say we consider it a real bug for -O2 and above and maybe -Os. One issue is that people may use their own compiler flags for optimizations, maybe we should at least emit a warning in the Makefile if CFLAGS is defined. |
I don't really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it's known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now. |
Ok, thanks for the quick reply! By the way, do you know where you got this code from? (I don't think it's relevant then, I'm just curious.) I'll update the PR then to remove the ternaries also in the other file. |
41aec59
to
10a4b2b
Compare
This prevents GCC from generating branches on PowerPC in certain cases. Fixes bitcoin-core#771.
10a4b2b
to
5b19633
Compare
Okay, ready for review. |
Hmmm... You know, it might happen that I'm actually responsible for it after all... Note that it's in BN_MULT_HIGH On side note, I don't see any difference in generated code on PPC... Out of curiosity, which compiler are you looking at? I'm looking at gcc 9.3... |
We see the ternary causing branches on GCC 10.1.1 (Fedora 32) on PPC64LE. ... but not always, it does it in some cases and not others (in particular, it looks like its doing it when its switching on a 32-bit variable but not a 64-bit one?) |
9.3 doesn't and instead actually does something cool. It performs 64-bit subtraction of originally 32-bit values and then shift by 63. Oh well... I apologize for distraction. Cheers:-) |
Here's a godbolt playground: Here GCC 9 (and other versions) generate branches, notice the jump labels in the output. If you remove the |
@real-or-random Your update is clean on the PPC host in O2 for both 32 and 64 bit scalar, hurrah... so now that this passes, I attempt -Os on PPC64LE and I am filled with sadness because -Os does something to the binary that makes valgrind unusable (makes it think that all zero initialized stack constructed variables are uninitilized or something along those lines), presumably a valgrind bug. I will figure it out later after testing this master + this PR on a bunch of configurations. |
Preliminary report:
|
Very nice.
That sounds reasonable. I think we could even merge this now but if you want, let's see if it does not make things worse there either.
See #612 for context.
Hm I believe this is the same issue as in I propose we should get rid of |
secp256k1_int_cmov doesn't do anything the other cmovs don't do-- so if it's trouble then why won't the other ones (eventually) be trouble when the compiler gods are more angry? In the current use-- it could be a memczero just fine, so I'd be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn't be too surprised if some future change needs an actual cmov. If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8-- just to guard against the unlikely case that it makes something important worse. I'll try to do that in the next 24 hours. There is probably going to need to be some document that lists what compilers the software has been tested on or something. I'm not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we're hamstrung by the fact that it can't be run on embedded targets which are the most vulnerable to sidechannel attacks. |
I agree but I don't have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I'd like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts "optimizing" the cmov functions in the arithmetic code...
Great!
Yep. If I had a lot of time, I'd setup a fork of the compiler explorer in order to get a shitload of compilers and targets (it has docker images for the compilers), and then look automatically for branch instructions in a given list of functions. We could even make it look for variable-time instructions etc. |
Pushed but I kept the function for now and applied the same "fix" as in |
/* Access flag with a volatile-qualified lvalue. | ||
This prevents clang from figuring out (after inlining) that flag can | ||
take only be 0 or 1, which leads to variable time code. */ | ||
volatile int vflag = flag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make that volatile unsigned int
and then remove the casting a few lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't see why this is better. I prefer the current form because then it's just copy/paste from above. (I know duplicated code is to be avoided but duplicated and modified code is even worse.)
Sadly that doesn't work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn't seem to be something we could automate) |
Yeah it would certainly not be trivial, also with declassifications etc. I don't know. It would envision it rather as a thing where you audit the branches once and then it's some kind of CI thing that will tell you if a new PR will introduce new branches. So it's not perfect, but at least any changes in branching behavior get your attention (like with test coverage tools etc.) |
I confirm the latest commit fixes AMD64 Fedora 32 Clang 10.0.0 -Os. And the following also pass: arm7 Arch GCC 9.3.0 -O3 I just found an arm8 host, testing will be forthcoming. Also: dear lord ecdsa_verify: min 3088us / avg 3089us / max 3090us (i.MX53 no bignum, no asm) perhaps I should have looked harder for a faster device to test on... |
Okay, we can add the following as passes: armv8 Ubuntu GCC 9.3.0 -O3 As having no variable time bugs with this PR. However, there are spurious failures, which I'll open an issue for. |
AMD64 ICC 19.1.2.254
Your volatile tricks do not fool ICC. |
That's because that function does not use this ugly hack yet. Sad. secp256k1/src/scalar_8x32_impl.h Lines 721 to 725 in 5e1c885
Ok, that escalated more quickly than I expected. What do you think? Should we handle this in this PR, in another PR (that maybe looks into the cmovs in more detail) or not at all (because ICC)? |
@real-or-random Your call! I ACK this PR as is, though I'd like to continue this effort in another PR if we go the route of merging this.
Lol! hah. I called it and I didn't even notice. I hope you appreciate my motivation for testing test weird stuff is mostly because they act as proxies for things we can't so easily test (e.g. future compilers; or how current compilers would behave after some future PRs move code around in the library) or expose errors that by luck cancel each other out on common developers systems (after all, we've already fixed most of the easy ones) . ICC is a moderately important compiler in industry-- I think it results in the fastest libsecp256k1 binary that I've tested so far (4% faster than GCC) -- but I also think it's a lot less critical than gcc/clang. We shouldn't hold up improvements just waiting on more improvements. I've got a feeling we're going to be chasing these things for a while. I think I had a little batching preference before because its easier to test a bunch of changes at once when I'm testing across a bunch of systems. But I've already tested everything in this PR, as is, against a sufficient set of targets. |
I agree. Let's get this merged then. By the way, if you write your ACKs in the form "ACK commit-id", then they'll be included in the merge commit by our GitHub merge script https://github.com/bitcoin-core/bitcoin-maintainer-tools/blob/master/github-merge.py. |
@elichai I'm trying to press the "re-request review" button but it does not work, so I'm mentioning you here instead. :) |
I reviewed this and tested the heck out of it. It improves some systems and doesn't appear to make any I tested worse. I also did limited testing for performance (I only tested gcc 10+x86_64+noasm) and this PR didn't cause any measurable performance loss. ACK 67a429f |
locally with clang-10 on these lines: In master I get: 0010379e 48 85 c9 TEST param_4,param_4
001037a1 74 07 JZ LAB_001037aa
001037a6 f7 d8 NEG EAX
001037a8 21 01 AND dword ptr [param_4],EAX ghidra disassembly: if (local_230 != (uint *)0x0) {
*local_230 = *local_230 & -uVar6;
} but with this PR I get: 0010379e 48 85 c9 TEST param_4,param_4
001037a1 74 15 JZ LAB_001037b8
001037b1 83 c0 ff ADD EAX,-0x1
001037b4 23 01 AND EAX,dword ptr [param_4]
001037b6 89 01 MOV dword ptr [param_4],EAX ghidra disassembly: if (local_230 != (uint *)0x0) {
*local_230 = uVar11 - 1 & *local_230;
} both are constant time but weirdly different |
@elichai Well ok but is this an issue? And are you okay with my comment above on casting early? |
Nope, no issues, just wanted to report my findings. tACK 67a429f |
It's not exactly my business and it might be not as relevant, but in this time and age one can do better than being left at compiler's mercy to interpret the condition in question as branch or not. I mean the fact that it does it one way today is no guarantee that it won't do it otherwise tomorrow, with different flag or not. The "secret" keyword is |
The library uses it, on 64-bit. Most of the code we were discussing was the 32-bit code, while the 64-bit code is similar but uses larger limbs and works into a 128-bit int. Unfortunately it isn't a silver bullet either because we still need carries there too, just far fewer of them. [But thanks for the comment, if we hadn't been aware of that it would be a major revelation and much appreciated!] |
What I tried to hint at is that one can use 128-bit additions to avoid handling carries with conditionals. For example muladd could look like this:
This can be naturally extended to 32-bit platforms by replacing |
Summary: * Remove redundant "? 1 : 0" after comparisons in scalar code This prevents GCC from generating branches on PowerPC in certain cases. Fixes #771. * Suppress a harmless variable-time optimization by clang in _int_cmov Follow up on 52a03512c1d800603b5c923c1a28bdba12dadb30 This is a backport of libsecp256k1 [[bitcoin-core/secp256k1#772 | PR772]] Depends on D7590 Test Plan: ninja check-secp256k1 Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D7597
Summary: * Remove redundant "? 1 : 0" after comparisons in scalar code This prevents GCC from generating branches on PowerPC in certain cases. Fixes #771. * Suppress a harmless variable-time optimization by clang in _int_cmov Follow up on 52a03512c1d800603b5c923c1a28bdba12dadb30 This is a backport of libsecp256k1 [[bitcoin-core/secp256k1#772 | PR772]] Depends on D7590 Test Plan: ninja check-secp256k1 Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D7597
Attempt at resolving #771 .
This surprisingly seems to improve the situation at least for the compilers available on godbolt.