Speed up `filter_bytes` #6699

Dandandan · 2024-11-06T22:40:31Z

Which issue does this PR close?

MutableBuffer doesn't always generate optimal code compared to using Vec. I think the main performance difference might be in MutableBuffer::push (but not sure if that explains all the difference).

filter context string (kept 1/2)
                        time:   [385.65 µs 394.58 µs 405.71 µs]
                        change: [-52.835% -51.497% -50.056%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  13 (13.00%) high mild
  3 (3.00%) high severe

Benchmarking filter context string high selectivity (kept 1023/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
filter context string high selectivity (kept 1023/1024)
                        time:   [1.1102 ms 1.1526 ms 1.1964 ms]
                        change: [-11.666% -8.8390% -5.7644%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string low selectivity (kept 1/1024)
                        time:   [979.52 ns 987.11 ns 994.64 ns]
                        change: [-19.093% -18.289% -17.438%] (p = 0.00 < 0.05)
                        Performance has improved.

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

jayzhan211

👍

alamb · 2024-11-07T22:51:24Z

I also ran the benchmarks and see the same improvements 👍

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    674.8±6.62µs        ? ?/sec    1.01    679.6±6.81µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1196.4±2.54µs        ? ?/sec    1.02   1217.7±3.17µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1172.9±0.95ns        ? ?/sec    1.00   1164.3±2.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    442.1±2.81µs        ? ?/sec    1.01    447.8±4.74µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    783.1±1.14µs        ? ?/sec    1.01    790.9±3.73µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    936.5±1.46ns        ? ?/sec    1.01    947.4±5.04ns        ? ?/sec
filter context string (kept 1/2)                                              4.86      3.8±0.07ms        ? ?/sec    1.00   774.1±21.31µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00    244.2±1.43µs        ? ?/sec    1.00    243.6±1.24µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00    248.9±1.50µs        ? ?/sec    1.00    248.1±0.92µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    202.0±0.50µs        ? ?/sec    1.00    201.5±0.74µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.8±0.75µs        ? ?/sec    1.00    342.5±0.93µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.3±0.75µs        ? ?/sec    1.01    469.3±1.23µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.1±0.22µs        ? ?/sec    1.00    100.8±0.23µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.63      2.5±0.06ms        ? ?/sec    1.00  1523.2±31.32µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1993.4±3.93ns        ? ?/sec    1.00   1614.9±4.07ns        ? ?/sec

alamb · 2024-11-07T22:51:49Z

Thanks @Dandandan and @jayzhan211

Dandandan · 2024-11-08T14:33:42Z

I also ran the benchmarks and see the same improvements 👍

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    674.8±6.62µs        ? ?/sec    1.01    679.6±6.81µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1196.4±2.54µs        ? ?/sec    1.02   1217.7±3.17µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1172.9±0.95ns        ? ?/sec    1.00   1164.3±2.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    442.1±2.81µs        ? ?/sec    1.01    447.8±4.74µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    783.1±1.14µs        ? ?/sec    1.01    790.9±3.73µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    936.5±1.46ns        ? ?/sec    1.01    947.4±5.04ns        ? ?/sec
filter context string (kept 1/2)                                              4.86      3.8±0.07ms        ? ?/sec    1.00   774.1±21.31µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00    244.2±1.43µs        ? ?/sec    1.00    243.6±1.24µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00    248.9±1.50µs        ? ?/sec    1.00    248.1±0.92µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    202.0±0.50µs        ? ?/sec    1.00    201.5±0.74µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.8±0.75µs        ? ?/sec    1.00    342.5±0.93µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.3±0.75µs        ? ?/sec    1.01    469.3±1.23µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.1±0.22µs        ? ?/sec    1.00    100.8±0.23µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.63      2.5±0.06ms        ? ?/sec    1.00  1523.2±31.32µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1993.4±3.93ns        ? ?/sec    1.00   1614.9±4.07ns        ? ?/sec

Wow that even seems a much bigger improvement 🤔

alamb · 2024-11-08T14:40:40Z

Wow that even seems a much bigger improvement 🤔

Will rerun and see if I can reprodice

alamb · 2024-11-08T15:47:39Z

Results of rerunning

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    676.9±1.46µs        ? ?/sec    1.00    680.2±3.73µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1195.0±3.18µs        ? ?/sec    1.02   1217.4±3.71µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1167.2±2.60ns        ? ?/sec    1.00   1156.4±1.15ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    441.9±3.60µs        ? ?/sec    1.02    451.0±2.40µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    780.6±2.07µs        ? ?/sec    1.01    790.6±1.75µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.01    930.2±2.89ns        ? ?/sec    1.00    918.1±4.38ns        ? ?/sec
filter context string (kept 1/2)                                              4.89      3.7±0.09ms        ? ?/sec    1.00   757.8±12.84µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.02    247.1±1.64µs        ? ?/sec    1.00    242.5±0.56µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02    252.3±1.74µs        ? ?/sec    1.00    248.4±1.13µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    201.8±0.46µs        ? ?/sec    1.00    200.9±0.48µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.7±0.77µs        ? ?/sec    1.00    341.9±0.52µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.6±1.22µs        ? ?/sec    1.01    469.1±3.93µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.0±0.09µs        ? ?/sec    1.00    101.0±0.38µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.67      2.5±0.07ms        ? ?/sec    1.00  1491.5±32.49µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1989.6±5.27ns        ? ?/sec    1.00   1622.9±4.10ns        ? ?/sec

Dandandan · 2024-11-08T15:58:20Z

Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro

alamb · 2024-11-08T16:13:20Z

Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro

This is what I am running on:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) CPU @ 3.10GHz
stepping        : 7
microcode       : 0xffffffff
cpu MHz         : 3100.310
cache size      : 25344 KB
physical id     : 0
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ts\
c_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle \
avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa mmio_stale_data retbleed eibrs_pbrsb bhi
bogomips        : 6200.62
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Dandandan · 2024-11-09T19:09:14Z

Ok, thanks for sharing!

github-actions bot added the arrow Changes to the arrow crate label Nov 6, 2024

Dandandan changed the title ~~Speed up~~ Speed up filter_bytes Nov 6, 2024

Dandandan force-pushed the speedup_filter branch from 8c423c0 to cf11d70 Compare November 6, 2024 22:54

Use vec

afc3191

Dandandan force-pushed the speedup_filter branch from cf11d70 to afc3191 Compare November 6, 2024 22:55

jayzhan211 approved these changes Nov 7, 2024

View reviewed changes

Use extend, fix capacity

d711fbc

alamb merged commit e9bf8aa into apache:master Nov 7, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `filter_bytes` #6699

Speed up `filter_bytes` #6699

Dandandan commented Nov 6, 2024 •

edited

Loading

jayzhan211 left a comment

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024

alamb commented Nov 8, 2024

alamb commented Nov 8, 2024

Dandandan commented Nov 8, 2024 •

edited

Loading

alamb commented Nov 8, 2024

Dandandan commented Nov 9, 2024

Speed up filter_bytes #6699

Speed up filter_bytes #6699

Conversation

Dandandan commented Nov 6, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024

alamb commented Nov 8, 2024

alamb commented Nov 8, 2024

Dandandan commented Nov 8, 2024 • edited Loading

alamb commented Nov 8, 2024

Dandandan commented Nov 9, 2024

Speed up `filter_bytes` #6699

Speed up `filter_bytes` #6699

Dandandan commented Nov 6, 2024 •

edited

Loading

Dandandan commented Nov 8, 2024 •

edited

Loading