You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In #1829 we removed AVX512 optimizations for AND/OR kernels since the autovectorized code was just as good, but there are some AVX512 instructions that could have a big benefit and which the compiler would not be able to use automatically. One of those extensions is the compressstore instruction which basically implements most of the filter kernel in a single instruction.
I recently experimented with those and found that, while our current filters are extremely good for extreme selectivities thanks to all the optimizations that @tustvold did, for selectivities between 5% and 99% the AVX512 version would be faster. For a random selectivity of 50% nearly 10x faster.
Describe the solution you'd like
There are a few open questions how to best integrate these functions into the filter kernels. They don't fit that well into the existing strategies, since they would be specific to primitive arrays, and there might be different selectivity cutoffs for falling back to one of the existing strategies.
We would also need to decide whether to statically dispatch to these kernels, based on target-cpu or target-feature, or use runtime feature detection.
The 8 and 16 bit versions of these instructions are also only available since the icelake generation, making testing a bit more difficult.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In #1829 we removed AVX512 optimizations for AND/OR kernels since the autovectorized code was just as good, but there are some AVX512 instructions that could have a big benefit and which the compiler would not be able to use automatically. One of those extensions is the
compressstore
instruction which basically implements most of the filter kernel in a single instruction.I recently experimented with those and found that, while our current filters are extremely good for extreme selectivities thanks to all the optimizations that @tustvold did, for selectivities between 5% and 99% the AVX512 version would be faster. For a random selectivity of 50% nearly 10x faster.
Describe the solution you'd like
There are a few open questions how to best integrate these functions into the filter kernels. They don't fit that well into the existing strategies, since they would be specific to primitive arrays, and there might be different selectivity cutoffs for falling back to one of the existing strategies.
We would also need to decide whether to statically dispatch to these kernels, based on
target-cpu
ortarget-feature
, or use runtime feature detection.The 8 and 16 bit versions of these instructions are also only available since the
icelake
generation, making testing a bit more difficult.Describe alternatives you've considered
There is a discussion on the portable-simd about portable alternatives to these instructions but that would require quite some work in llvm, since there are not portable llvm intrinsics yet, only the x86/avx512 implementations.
Additional context
Benchmark results for filtering i32 running on a
tigerlake
machine running at 3Ghz:The text was updated successfully, but these errors were encountered: