Improve constant-timeness on PowerPC #772

real-or-random · 2020-07-24T22:38:45Z

Attempt at resolving #771 .

This surprisingly seems to improve the situation at least for the compilers available on godbolt.

gmaxwell · 2020-07-24T23:03:07Z

Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 32 GCC 10.1.1 PPC64LE!

Bad news: I'm reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there. I don't currently have a ready to go i386 testing setup-- but I'll set one up if you'll work with me on this.

PPC64LE also is non-constant time on:

     sign = 2 * (u_last > 0) - 1;

I didn't see a clear way to fix that.

(also I think this highlights a greater problem that we can't trust statements like this to be constant time. I think we have a lot of them, they're not currently an issue through luck of the optimizer).

I'm happy to test PPC64LE btw (btw: luke-jr, grubles, bluematt, and td-linux on IRC also have ppc64 systems-- probably other people too, it's probably easier to get developers to test on PPC64LE right now, though i386 is still a relevant embedded/old system target).

gmaxwell · 2020-07-24T23:05:36Z

Aside, the 64-bit scalar code is also full of the "pointless" expressions-- on ppc64 w/ this compiler they don't compile into something valgrind complains about. ... and also I think they're there for the same reason they are there in the 32-bit code.

(aside, wow travis takes an absurdly long time to run)

real-or-random · 2020-07-25T10:56:50Z

Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 10 PPC64LE!

Hooray.

Bad news: I'm reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there.

Yeah, given that this syntax is literally redundant, I suspected it was there for a reason.

I had a closer look now. It seems to me that the only "bad" instruction is c2 += (c1 < th) ? 1 : 0; in muladd, and this is compiled to a branch only in the first of three subsequent muladds. It seems the compiler overdoes reordering here, and may get it wrong somehow as a a result of this.

Changing this single line to c2 += (c1 < th) or even c2 += (c1 >= th) ? 0 : 1; seems to do the trick. Of course all of this is somewhat fragile but I'm still somewhat confident that the latter won't change a lot for i386. I think we should document then, why these conditional operators are there.

I don't currently have a ready to go i386 testing setup-- but I'll set one up if you'll work with me on this.

Cool, let's do this.

PPC64LE also is non-constant time on:
     sign = 2 * (u_last > 0) - 1;
I didn't see a clear way to fix that.

Hm, yeah, that one seems harder. Have you tried a shift or a conditional operator (haha) or a conditional operator with a reversed condition (hahaha)?

(also I think this highlights a greater problem that we can't trust statements like this to be constant time. I think we have a lot of them, they're not currently an issue through luck of the optimizer).

Yeah but sadly that's still state of the art for this stuff.

gmaxwell · 2020-07-25T11:00:32Z

When we've more fully characterized the situation we can also take our pleadings to the compiler gods and maybe get some advice that will result in something more stable (or get an unintended 'optimization' turned off).

gmaxwell · 2020-07-26T01:45:11Z

Great news, valgrind still passes with this on i386 debian 8 (gcc 4.9.2), I've tried a mixture of compiler options (Os,O2, and mtune i486/pentium3/pentium4). I'm going to go and also try on a more modern x86 OS too and some other systems.

Assuming it all goes good I think we should remove all the useless ternaries but we should also perhaps put in an effort to figure out why they were there: I thought they were there to prevent branches, but it may be that we just copied the behaviour from openssl and assumed it was preventing branches.

sipa · 2020-07-26T01:48:24Z

I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468.

gmaxwell · 2020-07-26T02:05:09Z

FWIW w/ default current master emits a branch on x86 alpine linux (GCC 9.3.0) on the same ECDH sign line as on PPC64LE.

gmaxwell · 2020-07-26T02:34:55Z

Okay, so for the "sign = 2 * (u_last > 0) - 1;" on both Fedora PPC64 LE GCC 10.1.1 and Alpine i386 GCC 9.3.0 which emit a branch there, the issue is avoided by using "sign = -(u_last <= 0)|1;"

There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I'm concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can't see a lesson we can learn here about what not to do).

The alternative "sign = -(u_last < 0)|1;" is not an equivalent expression but also passes the tests and I think seems more natural. So assuming someone reviews the surrounding code and it's okay to have 0 end up with a sign of 1 instead of -1, and that this doesn't compile into anything awful on x86_64, I thinks that is probably what it should use.

We should probably figure out which of the alternatives are faster/cleaner and then just try out a complete suite of fixes (including getting rid of all the pointless ternaries, since we can't find any place where they help and we've found a place where they hurt) on as many targets as we can.

real-or-random · 2020-07-26T09:13:14Z

"sign = 2 * (u_last > 0) - 1;"

Oh, the better fix is #741, which just needs about one more ACK. I just wanted to point out that we reviewed the surrounding code already and I had not remembered earlier that it just removes the variable entirely. Sorry for wasting some of your time.

gmaxwell · 2020-07-26T09:26:29Z

Oh, sounds fine to me! I'm don't mind chasing some blind alleys so long as things improve.

real-or-random · 2020-07-26T09:32:04Z

I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468.

I'm sure it was copied from OpenSSL, the code says so:
https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_8x32_impl.h#L263

The code was written in this 21-years old commit openssl/openssl@fb81ac5 by @dot-asm. I'm sure I have no clue what I did 20 years ago but maybe we're lucky. @dot-asm, do you remember the reason for the ?1:0 in the macros? I could believe that you just were not aware that C guarantees that the result of a comparison is 0 or 1, or that you were concerned that reviewers are not aware. Or do you think there was a deeper reason?

There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I'm concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can't see a lesson we can learn here about what not to do).

Yeah, and I believe it's getting worse over time with more advanced optimizations. Sometimes a big hammers like in #728 just does it but this would not acceptable here for performance reasons.

gmaxwell · 2020-07-26T11:17:35Z

I've now tested a bunch of older compilers on x86 and actually find the newer ones seem to be generally more reliable-- e.g. you don't have to go far back with clang before its spewing branches everywhere-- e.g. managing to turn scalar_cmov into a ton of branches. GCC 2.95 does too, but it looks like anything ECGS or later only rarely has issues (like we're discussing in this PR). Unfortunately, I think the story is really just that the compiler behaviour isn't stable. When we wrote a bunch of this library we were in an island of good behaviour by compilers, but it might have just been luck.

Once we've fixed what we can for more important platforms, I'll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn't perfect.

real-or-random · 2020-07-26T11:33:32Z

Once we've fixed what we can for more important platforms, I'll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn't perfect.

Are we aware of further issues on a more important platform?

There's #711, which is also a constant-time PR that needs review.

It would be neat to have set of compilers and targets, where we can say we consider it a real bug for -O2 and above and maybe -Os. One issue is that people may use their own compiler flags for optimizations, maybe we should at least emit a warning in the Makefile if CFLAGS is defined.

dot-asm · 2020-07-26T12:12:14Z

@dot-asm, do you remember the reason for the ?1:0 in the macros?

I don't really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it's known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.

real-or-random · 2020-07-26T12:45:02Z

@dot-asm, do you remember the reason for the ?1:0 in the macros?

I don't really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it's known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.

Ok, thanks for the quick reply! By the way, do you know where you got this code from? (I don't think it's relevant then, I'm just curious.)

I'll update the PR then to remove the ternaries also in the other file.

This prevents GCC from generating branches on PowerPC in certain cases. Fixes bitcoin-core#771.

real-or-random · 2020-07-26T13:00:40Z

Okay, ready for review.

dot-asm · 2020-07-26T14:19:38Z

By the way, do you know where you got this code from?

Hmmm... You know, it might happen that I'm actually responsible for it after all... Note that it's in BN_MULT_HIGH ~~option~~section and I do remember myself looking for a way to get rid of branch on Alpha where the comparison should map into single instruction... It's not unlike that after cross-checking different compilers on different platforms (by modelling on non-Alpha obviously) a conclusion was drawn that explicit 1:0 is more likely to give desirable outcome across platforms. In other words it's not necessarily about specific bug in specific compiler, it might happen it's more about looking for specific outcome across multiple compilers... And now the circle seems to close, as this PR suggests that most like outcome is achieved if you don't have 1:0...

On side note, I don't see any difference in generated code on PPC... Out of curiosity, which compiler are you looking at? I'm looking at gcc 9.3...

gmaxwell · 2020-07-26T14:28:31Z

We see the ternary causing branches on GCC 10.1.1 (Fedora 32) on PPC64LE. ... but not always, it does it in some cases and not others (in particular, it looks like its doing it when its switching on a 32-bit variable but not a 64-bit one?)

dot-asm · 2020-07-26T14:40:09Z

We see the ternary causing branches on GCC 10.1.1 ... when switching on 32-bit variable

9.3 doesn't and instead actually does something cool. It performs 64-bit subtraction of originally 32-bit values and then shift by 63. Oh well... I apologize for distraction. Cheers:-)

real-or-random · 2020-07-26T14:44:36Z

@dot-asm

Here's a godbolt playground:
https://godbolt.org/z/3fv385

Here GCC 9 (and other versions) generate branches, notice the jump labels in the output. If you remove the ? 1 : 0 in line 287, the branches disappear.

gmaxwell · 2020-07-26T14:51:24Z

@real-or-random Your update is clean on the PPC host in O2 for both 32 and 64 bit scalar, hurrah... so now that this passes, I attempt -Os on PPC64LE and I am filled with sadness because -Os does something to the binary that makes valgrind unusable (makes it think that all zero initialized stack constructed variables are uninitilized or something along those lines), presumably a valgrind bug.

I will figure it out later after testing this master + this PR on a bunch of configurations.

gmaxwell · 2020-07-27T02:05:02Z

Preliminary report:

Tested with and without endomorphism, tested 32/64/asm scalar+group (as many as applicable)

i386 Alpine Linux GCC 9.3 -O3
i386 Alpine Linux GCC 9.3 -O2
i386 Alpine Linux GCC 9.3 -Os
i386 Alpine Linux Clang 10.0 -O3
i386 Alpine Linux Clang 10.0 -O2
i386 Alpine Linux Clang 10.0 -Os
i386 Debian8 GCC 4.9.2 -O3
i386 Debian8 GCC 4.9.2 -O2
i386 Debian8 GCC 4.9.2 -Os
*FAIL* i386 Debian8 Clang 3.5.0-10 -O3 *FAIL* (branches everywhere)
*FAIL* i386 Debian8 Clang 3.5.0-10 -O2 *FAIL* (branches everywhere)
*FAIL* i386 Debian8 Clang 3.5.0-10 -Os *FAIL* (branches everywhere)
*FAIL* i386 Debian8 GCC 2.95.3 -O2 *FAIL*  (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
*FAIL* i386 Debian8 GCC 2.95.3 -Os *FAIL*  (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
*FAIL* i386 Debian8 TCC 0.9.27 *FAIL* (similar behavior to GCC 2.95.3; but wow TCC compiles fast)
arm7 Debian8 GCC 4.9.2 -O3
arm7 Debian8 GCC 4.9.2 -O2
arm7 Debian8 GCC 4.9.2 -Os
*FAIL* arm7 Debian8 Clang 3.5.10-10 -O3 *FAIL* (like on i386; also ASM totally fails to compile)
*FAIL* arm7 Debian8 Clang 3.5.10-10 -O2 *FAIL* (like on i386; also ASM totally fails to compile)
*FAIL* arm7 Debian8 Clang 3.5.10-10 -Os *FAIL* (like on i386; also ASM totally fails to compile)
PPC64LE Fedora 32 GCC 10.1.1 -O3
PPC64LE Fedora 32 GCC 10.1.1 -O2
*FAIL* PPC64LE Fedora 32 GCC 10.1.1 -Os *FAIL*  (looks like GCC -Os breaks valgrind completely, failure isn't CT specific)
PPC64LE Fedora 32 Clang 10.0.0-2 -O3
PPC64LE Fedora 32 Clang 10.0.0-2 -O2
*FAIL* PPC64LE Fedora 32 Clang 10.0.0-2 -Os *FAIL* (This is due to a longstanding cross platform clang miscompliation bug[1], not a CT issue)
AMD64 Fedora 32 GCC 10.1.1 -O3
AMD64 Fedora 32 GCC 10.1.1 -O2
AMD64 Fedora 32 GCC 10.1.1 -Os
*FAIL* AMD64 Fedora 32 Clang 10.0.0 -Os *FAIL* (emits a branch at secp256k1_int_cmov (util.h:205))
AMD64 Fedora 32 Clang 10.0.0 -O3
AMD64 Fedora 32 Clang 10.0.0 -O2

(I also tested the PR on BE sparc, and i386 haiku but can't valgrind in either or those places)

This PR is a strict reduction in failures: everything that fails now failed before and some things that pass now failed before.

[1] Clang sometimes replaces struct assignments with memcpys without considering the possibility that the assigned structs could overlap (e.g. by being an assign to self.).  Clang's authors argue that this is a valgrind FP because the compiler is allowed to use "undefined behavior", which is true but they are being foolish in this case:  Memcpy is implemented by libc, which they have no visibility or control over and libc can change out from under them to make UB like this unsafe.  (In fact, it has done so before-- it used to be safe with overlaps, now its is only safe for overlaps that are assign to self).

TODO:
 Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)
 Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)
 Figure out what is the earliest version of clang that even kinda works.
 ICC, Maybe: Open64, Compcert
 ARM compiler? (used to be extremely popular in embedded, not sure if it is anymore)
 Is it possible to test MSVC using wine and animal sacrifice?
 Fix clang 10 AMD64 (or at least report it to clang developers)

real-or-random · 2020-07-27T09:38:21Z

Very nice.

TODO:
Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)

That sounds reasonable. I think we could even merge this now but if you want, let's see if it does not make things worse there either.

Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)

See #612 for context.

Fix clang 10 AMD64 (or at least report it to clang developers)

Hm I believe this is the same issue as in memczero. I'm just surprised that you can only see this for -Os, here it branches on O2/O3 too:

https://godbolt.org/z/57Ya89

I propose we should get rid of secp256k1_int_cmov and use memczero instead, that's simpler. I suspect it costs half of a microsecond but it's only used in pubkey recovery. 🤷 (Anyway, it's hard to write obscure code to annoy the optimizer and at the same time expect performance. If we want performance, we should consider having asm implementations for all the cmov functions at least on x86_64... )

gmaxwell · 2020-07-27T10:39:44Z

secp256k1_int_cmov doesn't do anything the other cmovs don't do-- so if it's trouble then why won't the other ones (eventually) be trouble when the compiler gods are more angry? In the current use-- it could be a memczero just fine, so I'd be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn't be too surprised if some future change needs an actual cmov.

If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8-- just to guard against the unlikely case that it makes something important worse. I'll try to do that in the next 24 hours.

There is probably going to need to be some document that lists what compilers the software has been tested on or something. I'm not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we're hamstrung by the fact that it can't be run on embedded targets which are the most vulnerable to sidechannel attacks.

real-or-random · 2020-07-27T12:04:43Z

secp256k1_int_cmov doesn't do anything the other cmovs don't do-- so if it's trouble then why won't the other ones (eventually) be trouble when the compiler gods are more angry? In the current use-- it could be a memczero just fine, so I'd be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn't be too surprised if some future change needs an actual cmov.

I agree but I don't have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I'd like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts "optimizing" the cmov functions in the arithmetic code...

If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8-- just to guard against the unlikely case that it makes something important worse. I'll try to do that in the next 24 hours.

Great!

There is probably going to need to be some document that lists what compilers the software has been tested on or something. I'm not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we're hamstrung by the fact that it can't be run on embedded targets which are the most vulnerable to sidechannel attacks.

Yep. If I had a lot of time, I'd setup a fork of the compiler explorer in order to get a shitload of compilers and targets (it has docker images for the compilers), and then look automatically for branch instructions in a given list of functions. We could even make it look for variable-time instructions etc.

Follow up on 52a0351

real-or-random · 2020-07-27T12:39:07Z

secp256k1_int_cmov doesn't do anything the other cmovs don't do-- so if it's trouble then why won't the other ones (eventually) be trouble when the compiler gods are more angry? In the current use-- it could be a memczero just fine, so I'd be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn't be too surprised if some future change needs an actual cmov.

I agree but I don't have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I'd like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts "optimizing" the cmov functions in the arithmetic code...

Pushed but I kept the function for now and applied the same "fix" as in memczero. I felt it's somewhat inconsistent to remove it given that we have also _scalar_cmov which is also used only to overwrite with zeros if signing fails (and could be replaced memczero too but that's another discussion then).

elichai · 2020-07-27T13:10:25Z

src/util.h

+    /* Access flag with a volatile-qualified lvalue.
+       This prevents clang from figuring out (after inlining) that flag can
+       take only be 0 or 1, which leads to variable time code. */
+    volatile int vflag = flag;


you can make that volatile unsigned int and then remove the casting a few lines below?

Hm, I don't see why this is better. I prefer the current form because then it's just copy/paste from above. (I know duplicated code is to be avoided but duplicated and modified code is even worse.)

gmaxwell · 2020-07-27T17:03:39Z

and then look automatically for branch instructions in a given list of functions.

Sadly that doesn't work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn't seem to be something we could automate)

real-or-random · 2020-07-27T17:15:28Z

and then look automatically for branch instructions in a given list of functions.

Sadly that doesn't work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn't seem to be something we could automate)

Yeah it would certainly not be trivial, also with declassifications etc. I don't know. It would envision it rather as a thing where you audit the branches once and then it's some kind of CI thing that will tell you if a new PR will introduce new branches. So it's not perfect, but at least any changes in branching behavior get your attention (like with test coverage tools etc.)

gmaxwell · 2020-07-28T01:31:09Z

I confirm the latest commit fixes

AMD64 Fedora 32 Clang 10.0.0 -Os.

And the following also pass:

arm7 Arch GCC 9.3.0 -O3
arm7 Arch GCC 9.3.0 -O2
arm7 Arch GCC 9.3.0 -Os
arm7 Arch Clang 10.0.1 -O3
arm7 Arch Clang 10.0.1 -O2
arm7 Arch Clang 10.0.1 -Os

I just found an arm8 host, testing will be forthcoming.

Also: dear lord ecdsa_verify: min 3088us / avg 3089us / max 3090us (i.MX53 no bignum, no asm) perhaps I should have looked harder for a faster device to test on...

gmaxwell · 2020-07-28T04:17:03Z

Okay, we can add the following as passes:

armv8 Ubuntu GCC 9.3.0 -O3
armv8 Ubuntu GCC 9.3.0 -O2
armv8 Ubuntu GCC 9.3.0 -Os
armv8 Ubuntu Clang 10.0.0-4 -O3
armv8 Ubuntu Clang 10.0.0-4 -O2
armv8 Ubuntu Clang 10.0.0-4 -Os

As having no variable time bugs with this PR.

However, there are spurious failures, which I'll open an issue for.

gmaxwell · 2020-07-28T06:50:21Z

AMD64 ICC 19.1.2.254

==407113== Conditional jump or move depends on uninitialised value(s)
==407113==    at 0x485FB93: secp256k1_ec_pubkey_create (secp256k1.c:568)
==407113==    by 0x401490: main (valgrind_ctime_test.c:56)
==407113== 
==407113== Conditional jump or move depends on uninitialised value(s)
==407113==    at 0x48674AA: secp256k1_scalar_is_zero (secp256k1.c:521)
==407113==    by 0x48674AA: secp256k1_scalar_set_b32_seckey (scalar_impl.h:61)
==407113==    by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
==407113==    by 0x48674AA: secp256k1_ecdsa_sign_inner (secp256k1.c:488)
==407113==    by 0x48670BE: secp256k1_ecdsa_sign_recoverable (main_impl.h:132)
==407113==    by 0x401821: main (valgrind_ctime_test.c:81)

Your volatile tricks do not fool ICC.

real-or-random · 2020-07-28T09:20:23Z

AMD64 ICC 19.1.2.254
==407113==    by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
Your volatile tricks do not fool ICC.

That's because that function does not use this ugly hack yet. Sad.

secp256k1/src/scalar_8x32_impl.h

Lines 721 to 725 in 5e1c885

    
           static SECP256K1_INLINE void secp256k1_scalar_cmov(secp256k1_scalar *r, const secp256k1_scalar *a, int flag) { 
        
               uint32_t mask0, mask1; 
        
               VG_CHECK_VERIFY(r->d, sizeof(r->d)); 
        
               mask0 = flag + ~((uint32_t)0); 
        
               mask1 = ~mask0;

secp256k1_int_cmov doesn't do anything the other cmovs don't do-- so if it's trouble then why won't the other ones (eventually) be trouble when the compiler gods are more angry?

Ok, that escalated more quickly than I expected.

What do you think? Should we handle this in this PR, in another PR (that maybe looks into the cmovs in more detail) or not at all (because ICC)?

gmaxwell · 2020-07-28T09:42:53Z

@real-or-random Your call! I ACK this PR as is, though I'd like to continue this effort in another PR if we go the route of merging this.

That's because that function does not use this ugly hack yet. Sad.

Lol! hah. I called it and I didn't even notice.

I hope you appreciate my motivation for testing test weird stuff is mostly because they act as proxies for things we can't so easily test (e.g. future compilers; or how current compilers would behave after some future PRs move code around in the library) or expose errors that by luck cancel each other out on common developers systems (after all, we've already fixed most of the easy ones) .

ICC is a moderately important compiler in industry-- I think it results in the fastest libsecp256k1 binary that I've tested so far (4% faster than GCC) -- but I also think it's a lot less critical than gcc/clang. We shouldn't hold up improvements just waiting on more improvements. I've got a feeling we're going to be chasing these things for a while.

I think I had a little batching preference before because its easier to test a bunch of changes at once when I'm testing across a bunch of systems. But I've already tested everything in this PR, as is, against a sufficient set of targets.

real-or-random · 2020-07-28T11:02:37Z

I think I had a little batching preference before because its easier to test a bunch of changes at once when I'm testing across a bunch of systems. But I've already tested everything in this PR, as is, against a sufficient set of targets.

I agree. Let's get this merged then. By the way, if you write your ACKs in the form "ACK commit-id", then they'll be included in the merge commit by our GitHub merge script https://github.com/bitcoin-core/bitcoin-maintainer-tools/blob/master/github-merge.py.

real-or-random · 2020-07-28T11:04:23Z

@elichai I'm trying to press the "re-request review" button but it does not work, so I'm mentioning you here instead. :)

gmaxwell · 2020-07-28T11:23:20Z

I reviewed this and tested the heck out of it. It improves some systems and doesn't appear to make any I tested worse. I also did limited testing for performance (I only tested gcc 10+x86_64+noasm) and this PR didn't cause any measurable performance loss.

ACK 67a429f

elichai · 2020-07-28T13:41:51Z

locally with clang-10 on these lines:
https://github.com/bitcoin-core/secp256k1/blob/40412b1930483fe5d2ff184d071d2b5125f84970/src/secp256k1.c#L519:L522

In master I get:

        0010379e 48 85 c9        TEST       param_4,param_4
        001037a1 74 07           JZ         LAB_001037aa
        001037a6 f7 d8           NEG        EAX
        001037a8 21 01           AND        dword ptr [param_4],EAX

ghidra disassembly:

  if (local_230 != (uint *)0x0) {
    *local_230 = *local_230 & -uVar6;
  }

but with this PR I get:

        0010379e 48 85 c9        TEST       param_4,param_4
        001037a1 74 15           JZ         LAB_001037b8
        001037b1 83 c0 ff        ADD        EAX,-0x1
        001037b4 23 01           AND        EAX,dword ptr [param_4]
        001037b6 89 01           MOV        dword ptr [param_4],EAX

ghidra disassembly:

  if (local_230 != (uint *)0x0) {
    *local_230 = uVar11 - 1 & *local_230;
  }

both are constant time but weirdly different

real-or-random · 2020-07-28T13:47:31Z

@elichai Well ok but is this an issue?

And are you okay with my comment above on casting early?
#772 (comment)

elichai · 2020-07-28T13:53:09Z

Nope, no issues, just wanted to report my findings.
I did a bunch of local testing with compilers and flags and nothing got worse on my arch + compilers
so if it improves the situation on other targets(gmaxwell says it does) then that's great :)

tACK 67a429f

dot-asm · 2020-08-02T21:12:59Z

It's not exactly my business and it might be not as relevant, but in this time and age one can do better than being left at compiler's mercy to interpret the condition in question as branch or not. I mean the fact that it does it one way today is no guarantee that it won't do it otherwise tomorrow, with different flag or not. The "secret" keyword is __int128, the type supported by majority of contemporary 64-bit compilers, gcc, clang, icc, xlc (for Linux)...

gmaxwell · 2020-08-03T02:12:55Z

The library uses it, on 64-bit. Most of the code we were discussing was the 32-bit code, while the 64-bit code is similar but uses larger limbs and works into a 128-bit int. Unfortunately it isn't a silver bullet either because we still need carries there too, just far fewer of them.

[But thanks for the comment, if we hadn't been aware of that it would be a major revelation and much appreciated!]

dot-asm · 2020-08-03T08:26:28Z

What I tried to hint at is that one can use 128-bit additions to avoid handling carries with conditionals. For example muladd could look like this:

uint128_t t = (uint128_t)a * b + c0;
c0 = (uint64_t)t;
t = (t>>64) + c1;
c1 = (uint64_t)t;
c2 += (uint64_t)(t>>64);

This can be naturally extended to 32-bit platforms by replacing ~~uint64_t~~ 64 with ~~uint32_t~~ 32 and uint128_t with uint64_t. And I kind of assumed that it was like that on 32-bit platforms...

sipa · 2020-08-07T01:24:31Z

@dot-asm Of course... I don't understand why I never considered that, as the exact same code was already using doubly-wide types for the multiplication...

See #786.

Summary: * Remove redundant "? 1 : 0" after comparisons in scalar code This prevents GCC from generating branches on PowerPC in certain cases. Fixes #771. * Suppress a harmless variable-time optimization by clang in _int_cmov Follow up on 52a03512c1d800603b5c923c1a28bdba12dadb30 This is a backport of libsecp256k1 [[bitcoin-core/secp256k1#772 | PR772]] Depends on D7590 Test Plan: ninja check-secp256k1 Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D7597

real-or-random force-pushed the 202007-ct-power9 branch from 41aec59 to 10a4b2b Compare July 26, 2020 12:59

real-or-random marked this pull request as ready for review July 26, 2020 12:59

Remove redundant "? 1 : 0" after comparisons in scalar code

5b19633

This prevents GCC from generating branches on PowerPC in certain cases. Fixes bitcoin-core#771.

real-or-random force-pushed the 202007-ct-power9 branch from 10a4b2b to 5b19633 Compare July 26, 2020 13:00

Suppress a harmless variable-time optimization by clang in _int_cmov

67a429f

Follow up on 52a0351

elichai reviewed Jul 27, 2020

View reviewed changes

real-or-random mentioned this pull request Jul 28, 2020

secp256k1_scalar_cmov not constant-time on ICC #777

Open

real-or-random merged commit 214cb3c into bitcoin-core:master Jul 28, 2020

sipa mentioned this pull request Aug 7, 2020

Use double-wide types for additions in scalar code #786

Closed

Improve constant-timeness on PowerPC #772

Improve constant-timeness on PowerPC #772

Conversation

real-or-random commented Jul 24, 2020 • edited Loading

gmaxwell commented Jul 24, 2020 • edited Loading

gmaxwell commented Jul 24, 2020 • edited Loading

real-or-random commented Jul 25, 2020

gmaxwell commented Jul 25, 2020

gmaxwell commented Jul 26, 2020

sipa commented Jul 26, 2020

gmaxwell commented Jul 26, 2020

gmaxwell commented Jul 26, 2020

real-or-random commented Jul 26, 2020

gmaxwell commented Jul 26, 2020

real-or-random commented Jul 26, 2020

gmaxwell commented Jul 26, 2020 • edited Loading

real-or-random commented Jul 26, 2020

dot-asm commented Jul 26, 2020

real-or-random commented Jul 26, 2020

real-or-random commented Jul 26, 2020

dot-asm commented Jul 26, 2020 • edited Loading

gmaxwell commented Jul 26, 2020 • edited Loading

dot-asm commented Jul 26, 2020

real-or-random commented Jul 26, 2020

gmaxwell commented Jul 26, 2020 • edited Loading

gmaxwell commented Jul 27, 2020 • edited Loading

real-or-random commented Jul 27, 2020

gmaxwell commented Jul 27, 2020

real-or-random commented Jul 27, 2020 • edited Loading

real-or-random commented Jul 27, 2020

elichai Jul 27, 2020

Choose a reason for hiding this comment

real-or-random Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

gmaxwell commented Jul 27, 2020

real-or-random commented Jul 27, 2020 • edited Loading

gmaxwell commented Jul 28, 2020 • edited Loading

gmaxwell commented Jul 28, 2020 • edited Loading

gmaxwell commented Jul 28, 2020

real-or-random commented Jul 28, 2020

gmaxwell commented Jul 28, 2020

real-or-random commented Jul 28, 2020

real-or-random commented Jul 28, 2020

gmaxwell commented Jul 28, 2020 • edited Loading

elichai commented Jul 28, 2020

real-or-random commented Jul 28, 2020

elichai commented Jul 28, 2020 • edited Loading

dot-asm commented Aug 2, 2020

gmaxwell commented Aug 3, 2020

dot-asm commented Aug 3, 2020 • edited Loading

sipa commented Aug 7, 2020

real-or-random commented Jul 24, 2020 •

edited

Loading

gmaxwell commented Jul 24, 2020 •

edited

Loading

gmaxwell commented Jul 24, 2020 •

edited

Loading

gmaxwell commented Jul 26, 2020 •

edited

Loading

dot-asm commented Jul 26, 2020 •

edited

Loading

gmaxwell commented Jul 26, 2020 •

edited

Loading

gmaxwell commented Jul 26, 2020 •

edited

Loading

gmaxwell commented Jul 27, 2020 •

edited

Loading

real-or-random commented Jul 27, 2020 •

edited

Loading

real-or-random Jul 27, 2020 •

edited

Loading

real-or-random commented Jul 27, 2020 •

edited

Loading

gmaxwell commented Jul 28, 2020 •

edited

Loading

gmaxwell commented Jul 28, 2020 •

edited

Loading

gmaxwell commented Jul 28, 2020 •

edited

Loading

elichai commented Jul 28, 2020 •

edited

Loading

dot-asm commented Aug 3, 2020 •

edited

Loading