AVX512 optimized filter kernels for primitive types #1949

jhorstmann · 2022-06-26T18:13:37Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In #1829 we removed AVX512 optimizations for AND/OR kernels since the autovectorized code was just as good, but there are some AVX512 instructions that could have a big benefit and which the compiler would not be able to use automatically. One of those extensions is the compressstore instruction which basically implements most of the filter kernel in a single instruction.

I recently experimented with those and found that, while our current filters are extremely good for extreme selectivities thanks to all the optimizations that @tustvold did, for selectivities between 5% and 99% the AVX512 version would be faster. For a random selectivity of 50% nearly 10x faster.

Describe the solution you'd like

There are a few open questions how to best integrate these functions into the filter kernels. They don't fit that well into the existing strategies, since they would be specific to primitive arrays, and there might be different selectivity cutoffs for falling back to one of the existing strategies.

We would also need to decide whether to statically dispatch to these kernels, based on target-cpu or target-feature, or use runtime feature detection.

The 8 and 16 bit versions of these instructions are also only available since the icelake generation, making testing a bit more difficult.

Describe alternatives you've considered

There is a discussion on the portable-simd about portable alternatives to these instructions but that would require quite some work in llvm, since there are not portable llvm intrinsics yet, only the x86/avx512 implementations.

Additional context

Benchmark results for filtering i32 running on a tigerlake machine running at 3Ghz:

Gnuplot not found, using plotters backend
filter i32 (kept 50%)   time:   [55.624 us 55.657 us 55.699 us]                                  

filter i32 high selectivity (kept 95%)                                                                             
                        time:   [18.635 us 18.650 us 18.671 us]

filter i32 low selectivity (kept 5%)                                                                             
                        time:   [5.6434 us 5.6778 us 5.7203 us]

filter i32 avx512 (kept 50%)                                                                             
                        time:   [6.0487 us 6.0529 us 6.0579 us]

filter i32 avx512 high selectivity (kept 95%)                                                                             
                        time:   [6.2818 us 6.2847 us 6.2879 us]

filter i32 avx512 low selectivity (kept 5%)                                                                             
                        time:   [5.4591 us 5.4618 us 5.4651 us]

The text was updated successfully, but these errors were encountered:

jhorstmann added the enhancement Any new improvement worthy of a entry in the changelog label Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX512 optimized filter kernels for primitive types #1949

AVX512 optimized filter kernels for primitive types #1949

jhorstmann commented Jun 26, 2022

AVX512 optimized filter kernels for primitive types #1949

AVX512 optimized filter kernels for primitive types #1949

Comments

jhorstmann commented Jun 26, 2022