-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8: AVX2 implementation of Valid #58
Conversation
858c621
to
45cfe78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks really clean, nice work so far 🙌
First bug found by the Go1.18 fuzzing system:
|
This is a direct shift and lift operation. Lots of opportunities to refactor.
Not used.
Also fix errors in some of the tables.
I don't feel bad if we don't reuse the stdlib symbols, taking dependencies on unexpired APIs always has a hire maintenance cost. |
As an experiment, commit 4a7bb03 shows what it would look like to call the stdlib directly as opposed to re-implementing it. It's slightly slower on the current benchmarks, but the easier maintenance is probably worth it. |
"github.com/segmentio/asm/ascii" | ||
) | ||
|
||
func FuzzValid(f *testing.F) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
// Prepare intermediate vector for push operations | ||
VPERM2I128 $0x03, Y8, Y11, Y8 | ||
|
||
// Check errors on the high nibble of the previous byte | ||
VPALIGNR $0x0f, Y8, Y11, Y10 | ||
VPSRLW $0x04, Y10, Y12 | ||
VPAND Y12, Y6, Y12 | ||
VPSHUFB Y12, Y3, Y12 | ||
|
||
// Check errors on the low nibble of the previous byte | ||
VPAND Y10, Y6, Y10 | ||
VPSHUFB Y10, Y4, Y10 | ||
VPAND Y10, Y12, Y12 | ||
|
||
// Check errors on the high nibble on the current byte | ||
VPSRLW $0x04, Y11, Y10 | ||
VPAND Y10, Y6, Y10 | ||
VPSHUFB Y10, Y5, Y10 | ||
VPAND Y10, Y12, Y12 | ||
|
||
// Find 3 bytes continuations | ||
VPALIGNR $0x0e, Y8, Y11, Y10 | ||
VPSUBUSB Y2, Y10, Y10 | ||
|
||
// Find 4 bytes continuations | ||
VPALIGNR $0x0d, Y8, Y11, Y8 | ||
VPSUBUSB Y1, Y8, Y8 | ||
|
||
// Combine them to have all continuations | ||
VPOR Y10, Y8, Y8 | ||
|
||
// Perform a byte-sized signed comparison with zero to turn any non-zero bytes into 0xFF. | ||
VXORPS Y10, Y10, Y10 | ||
VPCMPGTB Y10, Y8, Y8 | ||
|
||
// Find bytes that are continuations by looking at their most significant bit. | ||
VPAND Y7, Y8, Y8 | ||
|
||
// Find mismatches between expected and actual continuation bytes | ||
VPXOR Y8, Y12, Y8 | ||
|
||
// Store result in sticky error | ||
VPOR Y9, Y8, Y9 | ||
|
||
// Prepare for next iteration | ||
VPSUBUSB Y0, Y11, Y10 | ||
VMOVDQU Y11, Y8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be able to improve performance here by allocating registers yourself. It looks like avo's register allocator has introduced false data dependencies, and allocating registers yourself (and using more of them) might let you eliminate the dependencies.
Difference validating inputs using AVX with leftover bytes, between the memory scratch and fully in vector registers:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
This branch is a Go implementation of the Keiser-Lemire "Validating UTF-8 In Less Than One Instruction Per
Byte" paper. For inputs under 32 bytes or on machines without AVX2 support, a re-implementation of the stdlib algorithm is used.
For incomplete blocks of 32 bytes, this version still uses the vector registers.
This code exposes two functions
Valid([]byte) bool
andValidate([]byte) (bool, bool)
.Valid
is a drop-in replacement for the standard library'sunicode.Valid
.Validate
is a more precise function that also returns whether the input was valid ASCII. For small strings,ascii.Valid
is used as a first pass, then stdlib'sutf8.Valid
is used. This is possibly responsible for the overhead we are seeing for inputs < 32 bytes.Current results:
This is my first time writing Go assembly, so I'd appreciate any kind of feedback!
ns/op, for arrays up to 400 bytes (lower is better):
ns/op, for arrays up to 64MiB (lower is better):
Machine: specs
Code used to generate graphs: plot.py
Todo
Understand why the low overhead algorithm is slower than stdlib.Not understood, but after iterating on the code, the low overhead algorithm is as fast as the standard library one of an Intel CPU (not AMD, somehow).utf8.first
andacceptRanges
tables.Cover profile for generated asm code. I don't think that's possible.Further work