Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport 2.16: range-based constant-flow base64 #4819

Merged
merged 15 commits into from
Oct 27, 2021

Conversation

gilles-peskine-arm
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm commented Jul 28, 2021

Fix #4814: in Mbed TLS 2.26.0 and 2.16.10, we made the base64 code constant-flow by changing table lookup into constant-time table lookup (look up every item and or them together). This had a significant performance cost. This pull request uses a different approach: instead of doing a constant-flow table lookup, do a range-based or. Since base64 has only 5 ranges, as opposed to 64/128 table entries (encoding/decoding), this turns out to be a significant performance improvement. There is also a slight gain in code size (but still a loss compared to the original non-constant-time table approach).

Version 2.16.9 2.16.11 this PR
load_roots on 552 certs (x86_64) 11ms 428 ms 21 ms
size base64.o (thumb) 1147 1287 1187

This PR also introduces a new sample program which I used for benchmarking. Since it's there I propose to keep it, but extra features (including nontrivial amounts of documentation, or CI integration) are out of scope.

I started with a 2.16 patch because this all started from code I'd written in the Mbed Crypto days. This will need to be ported to all branches - 2.2x, development TODO.

Base64 decoding uses equality comparison tests for characters that don't
leak information about the content of the data other than its length, such
as whitespace. Do this with '=' as well, since it only reveals information
about the length. This way the table lookup can focus on character validity
and decoding value.

Signed-off-by: Gilles Peskine <[email protected]>
Instead of doing constant-flow table lookup, which requires 128 memory loads
for each lookup into a 128-entry table, do a range-based calculation, which
requires more CPU instructions per range but there are only 5 ranges.

Experimentally, this is ~12x faster on my PC (based on
programs/x509/load_roots). The code is slightly smaller, too.

Signed-off-by: Gilles Peskine <[email protected]>
Document what each local variable does when it isn't obvious from the name.
Don't reuse a variable for different purposes.

This commit has very little impact on the generated code (same code size on
a sample Thumb build), although it does fix a theoretical bug that 2^32
spaces inside a line would be ignored instead of treated as an error.

Signed-off-by: Gilles Peskine <[email protected]>
Instead of doing constant-flow table lookup, which requires 64 memory loads
for each lookup into a 64-entry table, do a range-based calculation, which
requires more CPU instructions per range but there are only 5 ranges.

I expect a significant performance gain (although smaller than for decoding
since the encoding table is half the size), but I haven't measured. Code
size is slightly smaller.

Signed-off-by: Gilles Peskine <[email protected]>
@gilles-peskine-arm gilles-peskine-arm added bug needs-review Every commit must be reviewed by at least two team members, needs-backports Backports are missing or are pending review and approval. component-crypto Crypto primitives and low-level interfaces needs-reviewer This PR needs someone to pick it up for review labels Jul 28, 2021
@gilles-peskine-arm gilles-peskine-arm requested a review from mpg July 28, 2021 12:50
struct mbedtls_timing_hr_time timer;
unsigned long ms;

if( argc == 0 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if( argc == 0 )
if( argc < 2 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, when testing the program I didn't manage to get the usage string printed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a couple of places in the code where same comparison is made before printing USAGE. They should be at least subject of deeper look because they can be quite legit.

Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good. It's very close to the previous version, as the main different is the constant-time table-lookup function was replaced with a function that does mostly the same thing but with a different implementation, so I think the risk of introducing issues is pretty low. I've reviewed the other changes carefully and couldn't find any place that would leak secret information, all the information that's leaked is the position of special chars in the input and the length of the output, none of which are sensitive.

Regarding testing, considering there's a fairly large amount of information that's acceptable to leak, it probably doesn't make sense to try making the whole decode function testable with valgrind/memsan, but I'm thinking perhaps dec_value could be made STATIC_TESTABLE and tested with an input marked as TEST_CF_SECRETin the test suite? (Obviously that would be for the forward ports, as we don't have those testing facilities in 2.16.)

Other than that, I left a couple of mostly minor comments.

library/base64.c Outdated Show resolved Hide resolved
library/base64.c Outdated Show resolved Hide resolved
library/base64.c Show resolved Hide resolved
library/base64.c Show resolved Hide resolved
TRodziewicz
TRodziewicz previously approved these changes Jul 30, 2021
Copy link
Contributor

@TRodziewicz TRodziewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - for me it looks like constant-flow/time code.

programs/x509/load_roots.c Outdated Show resolved Hide resolved
n was used for two different purposes. Give it a different name the second
time. This does not seem to change the generated code when compiling with
optimization for size or performance.

Signed-off-by: Gilles Peskine <[email protected]>
To test c <= high, instead of testing the sign of (high + 1) - c, negate the
sign of high - c (as we're doing for c - low). This is a little easier to
read and shaves 2 instructions off the arm thumb build with
arm-none-eabi-gcc 7.3.1.

Signed-off-by: Gilles Peskine <[email protected]>
I had originally thought to support directories with
mbedtls_x509_crt_parse_path but it would have complicated the code more than
I cared for. Remove a remnant of the original project in the documentation.

Signed-off-by: Gilles Peskine <[email protected]>
@gilles-peskine-arm
Copy link
Contributor Author

Regarding testing, considering there's a fairly large amount of information that's acceptable to leak, it probably doesn't make sense to try making the whole decode function testable with valgrind/memsan, but I'm thinking perhaps dec_value could be made STATIC_TESTABLE and tested with an input marked as TEST_CF_SECRETin the test suite? (Obviously that would be for the forward ports, as we don't have those testing facilities in 2.16.)

Yes, I think enc_char and dec_value should use this testing. There's some delicate logic in mbedtls_base64_decode which would be hard to test because it relies heavily on the fact that it's ok to test the equality of an input character against a non-digit constant. My earlier version at https://github.com/gilles-peskine-arm/mbedtls/tree/base64-no-table-2.16.9/library/base64.c made this simpler by first doing a “character type” calculation in constant time, and then basing the logic in mbedtls_base64_decode on this non-sensitive partial information, but this had a measurable cost in code size and performance.

@gilles-peskine-arm gilles-peskine-arm requested a review from mpg July 30, 2021 13:57
mpg
mpg previously approved these changes Aug 2, 2021
Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback, looks all good to me now!

@mpg mpg requested a review from TRodziewicz August 2, 2021 07:42
Copy link
Contributor

@davidhorstmann-arm davidhorstmann-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed one typo - sorry I did not get this on the first round.

Changes
* Improve the performance of base64 constant-flow code. The result is still
slower than the original non-constant-flow implementation, but much faster
than the previous constant-flow implemenation. Fixes #4814.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo nit: "implemenation" -> "implementation"

@akien-mga
Copy link

akien-mga commented Oct 24, 2021

Hello! This seems to have been stalled for two months due to two issues in some comments. Is there any hope to see this finalized and merged for the next 2.16.x release? Alternatively, is this in a good enough state that we could use our own backport of this PR downstream (specifically the Godot game engine which uses mbedtls 2.16 and is affected by this regression)?

@gilles-peskine-arm
Copy link
Contributor Author

@akien-mga Thanks for the ping! This seems to have gotten lost in people successively taking over then going on vacation and coming back to other tasks. We're bringing back to the top of the queue. In the meantime, the fix in this pull request is approved except for some documentation issues and has passed our CI, so you can safely go ahead and pick it up.

Signed-off-by: Gilles Peskine <[email protected]>
Signed-off-by: Gilles Peskine <[email protected]>
@gilles-peskine-arm gilles-peskine-arm dismissed stale reviews from mpg and TRodziewicz via 8e82c78 October 25, 2021 19:14
@gilles-peskine-arm gilles-peskine-arm added needs-review Every commit must be reviewed by at least two team members, and removed needs-work labels Oct 25, 2021
Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpg
Copy link
Contributor

mpg commented Oct 26, 2021

I checked that all of the CI failures in the pr-head job are instances of #5012 and the pr-merge job passes, so the pr-head failures can be ignored.

Copy link
Contributor

@davidhorstmann-arm davidhorstmann-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpg mpg added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Oct 27, 2021
@mpg mpg merged commit 70227d2 into Mbed-TLS:mbedtls-2.16 Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports bug component-crypto Crypto primitives and low-level interfaces needs-backports Backports are missing or are pending review and approval.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Huge certificate parsing speed regression between mbedtls 2.16.9 and 2.16.10 (constant time base64).
6 participants