utf8: AVX2 implementation of Valid #58

pelletier · 2021-10-14T01:44:09Z

This branch is a Go implementation of the Keiser-Lemire "Validating UTF-8 In Less Than One Instruction Per
Byte" paper. For inputs under 32 bytes or on machines without AVX2 support, a re-implementation of the stdlib algorithm is used.

For incomplete blocks of 32 bytes, this version still uses the vector registers.

This code exposes two functions Valid([]byte) bool and Validate([]byte) (bool, bool). Valid is a drop-in replacement for the standard library's unicode.Valid. Validate is a more precise function that also returns whether the input was valid ASCII. For small strings, ascii.Valid is used as a first pass, then stdlib's utf8.Valid is used. This is possibly responsible for the overhead we are seeing for inputs < 32 bytes.

Current results:

goos: darwin
goarch: amd64
pkg: github.com/segmentio/asm/utf8
cpu: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

name                    time/op
Valid/1kValid/AVX-8       80.0ns ± 2%
Valid/1kValid/Stdlib-8     733ns ± 2%
Valid/1MValid/AVX-8       76.8µs ± 2%
Valid/1MValid/Stdlib-8     751µs ± 1%
Valid/10ASCII/Stdlib-8    4.07ns ± 0%
Valid/10ASCII/AVX-8       7.70ns ± 2%
Valid/10Japan/AVX-8       28.6ns ± 1%
Valid/10Japan/Stdlib-8    27.0ns ± 1%

name                    speed
Valid/1kValid/AVX-8     12.8GB/s ± 2%
Valid/1kValid/Stdlib-8  1.40GB/s ± 2%
Valid/1MValid/AVX-8     13.7GB/s ± 2%
Valid/1MValid/Stdlib-8  1.40GB/s ± 1%
Valid/10ASCII/Stdlib-8  2.46GB/s ± 0%
Valid/10ASCII/AVX-8     1.30GB/s ± 2%
Valid/10Japan/AVX-8     1.05GB/s ± 1%
Valid/10Japan/Stdlib-8  1.11GB/s ± 1%

This is my first time writing Go assembly, so I'd appreciate any kind of feedback!

ns/op, for arrays up to 400 bytes (lower is better):

ns/op, for arrays up to 64MiB (lower is better):

Machine: specs
Code used to generate graphs: plot.py

Todo

Generate code with AVO.
Check AVX2 support.
Use lower overhead algorithm for < 32B.
~~Understand why the low overhead algorithm is slower than stdlib.~~ Not understood, but after iterating on the code, the low overhead algorithm is as fast as the standard library one of an Intel CPU (not AMD, somehow).
Make the test suite faster.
Also returns whether the input was ascii only.
Fix table generation (see 3716cfd)
Reuse stdlib's utf8.first and acceptRanges tables.
~~Cover profile for generated asm code~~. I don't think that's possible.

Further work

build/utf8/valid_asm.go

achille-roussel

The code looks really clean, nice work so far 🙌

build/utf8/valid_asm.go

pelletier · 2021-11-24T05:03:04Z

First bug found by the Go1.18 fuzzing system:

[tpelletier@thinkpad utf8]$ gotip test -run _ -fuzz ./
warning: starting with empty corpus
fuzz: elapsed: 0s, execs: 0 (0/sec), new interesting: 0 (total: 0)
fuzz: minimizing 1863-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzValid (0.41s)
    --- FAIL: FuzzValid (0.00s)
        valid_fuzz_test.go:16: Valid("0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\xc60") = true; want false
    
    Failing input written to testdata/fuzz/FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
    To re-run:
    go test -run=FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
FAIL
exit status 1
FAIL	github.com/segmentio/asm/utf8	0.416s

build/utf8/valid_asm.go

This is a direct shift and lift operation. Lots of opportunities to refactor.

Not used.

Also fix errors in some of the tables.

https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852 Co-authored-by: Achille <[email protected]>

achille-roussel · 2022-01-04T16:07:27Z

I don't feel bad if we don't reuse the stdlib symbols, taking dependencies on unexpired APIs always has a hire maintenance cost.

pelletier · 2022-01-04T16:26:37Z

As an experiment, commit 4a7bb03 shows what it would look like to call the stdlib directly as opposed to re-implementing it. It's slightly slower on the current benchmarks, but the easier maintenance is probably worth it.

utf8/valid_support_amd64.go

utf8/valid_go_test.go

utf8/valid_fuzz_test.go

utf8/valid_default.go

utf8/valid_support_amd64.go

utf8/utf8.go

build/utf8/valid_asm.go

utf8/valid_amd64.s

chriso · 2022-01-04T21:04:47Z

utf8/valid_go18_test.go

+	"github.com/segmentio/asm/ascii"
+)
+
+func FuzzValid(f *testing.F) {


chriso · 2022-01-05T02:04:48Z

utf8/valid_amd64.s

+	// Prepare intermediate vector for push operations
+	VPERM2I128 $0x03, Y8, Y11, Y8
+
+	// Check errors on the high nibble of the previous byte
+	VPALIGNR $0x0f, Y8, Y11, Y10
+	VPSRLW   $0x04, Y10, Y12
+	VPAND    Y12, Y6, Y12
+	VPSHUFB  Y12, Y3, Y12
+
+	// Check errors on the low nibble of the previous byte
+	VPAND   Y10, Y6, Y10
+	VPSHUFB Y10, Y4, Y10
+	VPAND   Y10, Y12, Y12
+
+	// Check errors on the high nibble on the current byte
+	VPSRLW  $0x04, Y11, Y10
+	VPAND   Y10, Y6, Y10
+	VPSHUFB Y10, Y5, Y10
+	VPAND   Y10, Y12, Y12
+
+	// Find 3 bytes continuations
+	VPALIGNR $0x0e, Y8, Y11, Y10
+	VPSUBUSB Y2, Y10, Y10
+
+	// Find 4 bytes continuations
+	VPALIGNR $0x0d, Y8, Y11, Y8
+	VPSUBUSB Y1, Y8, Y8
+
+	// Combine them to have all continuations
+	VPOR Y10, Y8, Y8
+
+	// Perform a byte-sized signed comparison with zero to turn any non-zero bytes into 0xFF.
+	VXORPS   Y10, Y10, Y10
+	VPCMPGTB Y10, Y8, Y8
+
+	// Find bytes that are continuations by looking at their most significant bit.
+	VPAND Y7, Y8, Y8
+
+	// Find mismatches between expected and actual continuation bytes
+	VPXOR Y8, Y12, Y8
+
+	// Store result in sticky error
+	VPOR Y9, Y8, Y9
+
+	// Prepare for next iteration
+	VPSUBUSB Y0, Y11, Y10
+	VMOVDQU  Y11, Y8


You may be able to improve performance here by allocating registers yourself. It looks like avo's register allocator has introduced false data dependencies, and allocating registers yourself (and using more of them) might let you eliminate the dependencies.

pelletier · 2022-01-06T03:05:58Z

Difference validating inputs using AVX with leftover bytes, between the memory scratch and fully in vector registers:

benchstat out-old.txt out-new.txt
name                 old time/op    new time/op     delta
Valid/tail300/AVX-8    32.4ns ± 2%     28.0ns ± 2%  -13.74%  (p=0.008 n=5+5)
Valid/tail316/AVX-8    32.6ns ± 0%     28.1ns ± 0%  -13.74%  (p=0.008 n=5+5)

name                 old speed      new speed       delta
Valid/tail300/AVX-8  9.26GB/s ± 2%  10.73GB/s ± 2%  +15.93%  (p=0.008 n=5+5)
Valid/tail316/AVX-8  9.69GB/s ± 0%  11.24GB/s ± 0%  +15.93%  (p=0.008 n=5+5)

pelletier · 2022-01-06T03:14:24Z

Nice to see the duration variations between multiples of 32 being dampened:

achille-roussel

🚢

cmd/valid/README.md

pelletier mentioned this pull request Oct 15, 2021

Decode: validate UTF-8 pelletier/go-toml#629

Merged

28 tasks

achille-roussel reviewed Oct 29, 2021

View reviewed changes

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

achille-roussel reviewed Oct 29, 2021

View reviewed changes

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

pelletier force-pushed the pelletier/utf8-valid branch from 858c621 to 45cfe78 Compare November 23, 2021 17:05

achille-roussel reviewed Nov 23, 2021

View reviewed changes

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

build/utf8/valid_asm.go Outdated Show resolved Hide resolved

achille-roussel reviewed Dec 1, 2021

View reviewed changes

build/utf8/valid_asm.go Show resolved Hide resolved

pelletier and others added 23 commits January 3, 2022 13:26

utf8: AVX2 implementation of valid

8781351

Rewrite with Avo

1bc60e4

This is a direct shift and lift operation. Lots of opportunities to refactor.

Add missing build file

60ebe9c

Scratch space should be 32 bytes

e6101cf

Remove setting the Unroll field

b43dff7

Not used.

Slow implementation passes all the tests

4a869a0

Do some benchmarking

4025b79

Set cutoff to 128 bytes

70ac67a

Generate tables from human description

f23287e

Also fix errors in some of the tables.

Check for AVX2 support

0841d55

Add missing file

d3f76ab

Default to stdlib for non-amd64 platforms

41fa615

Remove copy for remaining bytes

710631a

Add +build to default

740f972

More tests

fc4dbc6

Revert table generation

83df272

Fix table generation

ba68474

Found some bugs!

c08cbf9

Add fuzzing harness for go1.18

5ffccde

VZEROUPPER before returning

350ebaa

https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852 Co-authored-by: Achille <[email protected]>

Regen

5bab350

Fix continuationMaskData comment

3ce819f

Fix condition before jumping to the stdlib implm

cc2e8be

Use the actual stdlib

4a7bb03

Don't run ascii.Valid

71f4c92

achille-roussel reviewed Jan 4, 2022

View reviewed changes

utf8/valid_support_amd64.go Outdated Show resolved Hide resolved

Rename packages and return values

aaba3a8

achille-roussel reviewed Jan 4, 2022

View reviewed changes

utf8/utf8.go Outdated Show resolved Hide resolved

chriso reviewed Jan 4, 2022

View reviewed changes

pelletier added 4 commits January 4, 2022 20:33

Change signature of Validate

9a2449b

Remove unused imports

ed0e17b

Move MSB mask loading out of hot path

0814fd8

Reuse intermediate vector for pushN operations

e7c8a9e

chriso reviewed Jan 5, 2022

View reviewed changes

pelletier added 4 commits January 5, 2022 20:56

Load last block near page boundary

aa912a4

Fix invalid offset for large tail load

e4da37d

Add tail benchmarks

0f3e231

Fix API change in fuzzer

473f5ac

pelletier added 3 commits January 7, 2022 22:15

Add valid command line

f802637

Remove prompt

e603701

Add test for page boundary loads

f02a719

achille-roussel approved these changes Jan 9, 2022

View reviewed changes

cmd/valid/README.md Outdated Show resolved Hide resolved

pelletier added 2 commits January 8, 2022 22:32

Move cmd/valid to utf8/cmd/valid

1835aa3

Perform non-AVX2 fallback in go

a9b7485

pelletier changed the title ~~utf8: AVX2 implementation of valid~~ utf8: AVX2 implementation of Valid Jan 9, 2022

pelletier merged commit 0ec6ead into main Jan 11, 2022

pelletier deleted the pelletier/utf8-valid branch January 11, 2022 02:00

pelletier mentioned this pull request Jan 11, 2022

Try manually assigning registers #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8: AVX2 implementation of Valid #58

utf8: AVX2 implementation of Valid #58

pelletier commented Oct 14, 2021 •

edited

Loading

achille-roussel left a comment

pelletier commented Nov 24, 2021

achille-roussel commented Jan 4, 2022

pelletier commented Jan 4, 2022 •

edited

Loading

chriso Jan 4, 2022

chriso Jan 5, 2022 •

edited

Loading

pelletier commented Jan 6, 2022

pelletier commented Jan 6, 2022 •

edited

Loading

achille-roussel left a comment

utf8: AVX2 implementation of Valid #58

utf8: AVX2 implementation of Valid #58

Conversation

pelletier commented Oct 14, 2021 • edited Loading

Todo

Further work

achille-roussel left a comment

Choose a reason for hiding this comment

pelletier commented Nov 24, 2021

achille-roussel commented Jan 4, 2022

pelletier commented Jan 4, 2022 • edited Loading

chriso Jan 4, 2022

Choose a reason for hiding this comment

chriso Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

pelletier commented Jan 6, 2022

pelletier commented Jan 6, 2022 • edited Loading

achille-roussel left a comment

Choose a reason for hiding this comment

pelletier commented Oct 14, 2021 •

edited

Loading

pelletier commented Jan 4, 2022 •

edited

Loading

chriso Jan 5, 2022 •

edited

Loading

pelletier commented Jan 6, 2022 •

edited

Loading