Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve substr() performance by avoiding using owned string #13688

Merged
merged 1 commit into from
Dec 9, 2024

Conversation

richox
Copy link
Contributor

@richox richox commented Dec 8, 2024

Which issue does this PR close?

Closes #13687 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

tested with: cargo bench --bench substr
the result shows that substr_string and substr_large_string are running much faster in all cases.

   Compiling datafusion-functions v43.0.0 (/Volumes/Workspace/blaze-init/arrow-datafusion/datafusion/functions)
    Finished `bench` profile [optimized] target(s) in 2m 15s
     Running benches/substr.rs (target/release/deps/substr-a86cfe081ca878bf)
Gnuplot not found, using plotters backend
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]
                        time:   [17.906 µs 18.033 µs 18.166 µs]
                        change: [-5.6189% -4.0418% -2.4317%] (p = 0.00 < 0.05)
                        Performance has improved.
SHORTER THAN 12/substr_string [size=1024, strlen=12]
                        time:   [16.450 µs 16.554 µs 16.670 µs]
                        change: [-71.933% -71.626% -71.291%] (p = 0.00 < 0.05)
                        Performance has improved.
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]
                        time:   [18.539 µs 18.803 µs 19.144 µs]
                        change: [-70.591% -69.829% -69.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]
                        time:   [28.407 µs 28.605 µs 28.818 µs]
                        change: [-2.8033% -0.9698% +0.5511%] (p = 0.33 > 0.05)
                        No change in performance detected.
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]
                        time:   [29.680 µs 29.874 µs 30.084 µs]
                        change: [-64.029% -63.622% -63.202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high mild
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]
                        time:   [29.907 µs 30.260 µs 30.683 µs]
                        change: [-62.955% -61.999% -61.035%] (p = 0.00 < 0.05)
                        Performance has improved.

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]
                        time:   [26.295 µs 26.483 µs 26.682 µs]
                        change: [-5.6813% -4.7257% -3.7561%] (p = 0.00 < 0.05)
                        Performance has improved.
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]
                        time:   [29.560 µs 29.838 µs 30.140 µs]
                        change: [-59.939% -59.336% -58.767%] (p = 0.00 < 0.05)
                        Performance has improved.
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]
                        time:   [30.994 µs 31.199 µs 31.413 µs]
                        change: [-57.768% -57.255% -56.796%] (p = 0.00 < 0.05)
                        Performance has improved.

SHORTER THAN 12/substr_string_view [size=4096, strlen=12]
                        time:   [71.298 µs 71.768 µs 72.296 µs]
                        change: [-3.9430% -2.9409% -1.9043%] (p = 0.00 < 0.05)
                        Performance has improved.
SHORTER THAN 12/substr_string [size=4096, strlen=12]
                        time:   [64.237 µs 64.725 µs 65.242 µs]
                        change: [-72.971% -71.944% -71.285%] (p = 0.00 < 0.05)
                        Performance has improved.
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]
                        time:   [71.706 µs 72.637 µs 74.080 µs]
                        change: [-70.310% -69.729% -69.034%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]
                        time:   [111.94 µs 113.43 µs 115.24 µs]
                        change: [-2.1479% -0.4786% +1.4386%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]
                        time:   [114.55 µs 116.05 µs 118.04 µs]
                        change: [-64.142% -63.506% -62.753%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]
                        time:   [116.93 µs 117.71 µs 118.67 µs]
                        change: [-62.124% -61.458% -60.871%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]
                        time:   [104.15 µs 105.34 µs 106.74 µs]
                        change: [-5.7933% -4.3114% -2.7804%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]
                        time:   [114.42 µs 115.86 µs 117.83 µs]
                        change: [-60.701% -59.738% -58.852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]
                        time:   [122.53 µs 123.66 µs 125.01 µs]
                        change: [-56.827% -56.179% -55.534%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

Are there any user-facing changes?

no.

@richox richox changed the title Improve substr() performance by avoiding using owned string WIP: Improve substr() performance by avoiding using owned string Dec 8, 2024
@jonahgao
Copy link
Member

jonahgao commented Dec 8, 2024

CI failure should be caused by #13686

@Dandandan
Copy link
Contributor

I wonder if we can see perf change on TPC-H query 22 (which includes some substring expressions).

@richox
Copy link
Contributor Author

richox commented Dec 8, 2024

CI failure should be caused by #13686

it seems cargo test still fails on unrelated cases.

@richox richox changed the title WIP: Improve substr() performance by avoiding using owned string Improve substr() performance by avoiding using owned string Dec 8, 2024
richox pushed a commit to blaze-init/arrow-datafusion that referenced this pull request Dec 8, 2024
@jonahgao
Copy link
Member

jonahgao commented Dec 8, 2024

it seems cargo test still fails on unrelated cases.

I think merging the main branch can fix it.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @richox .

TPC-H query 22 uses string_view_substr, seems not affected by this PR. cc @Dandandan

@Dandandan
Copy link
Contributor

LGTM, thanks @richox .

TPC-H query 22 uses string_view_substr, seems not affected by this PR. cc @Dandandan

Ah makes sense

@alamb alamb merged commit b73734f into apache:main Dec 9, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 9, 2024

Thanks @richox @jonahgao and @Dandandan

zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 11, 2024
zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve substr() performance by avoiding using owned string
4 participants