Implement StringPairs for efficient pairs(::AbstractString) #51631

jakobnissen · 2023-10-07T15:44:48Z

The default pairs will iterate keys and values separately. For strings, this represents double work, since both these iterations will need to determine valid string indices.
The introduced StringPairs type will, whenever possible, only compute valid indices once.
Currently, this is only optimised for String and SubString{String}, and not for AbstractString, nor is it optimised when reversed.

Simple benchmark:

using BenchmarkTools
a = lpad("March", 20)
@btime lstrip($a)

Master: 71.6 ns
This PR: 23.9 ns

Closes #51624

StefanKarpinski

This looks good to me and is a nice optimization.

base/strings/util.jl

oscardssmith · 2023-10-11T16:15:43Z

Is this a better solution than #51671?

jakobnissen · 2023-10-11T16:22:41Z

See my comment here: #51631 (comment)
They both improve timings, and both PRs together are faster than either one.

jakobnissen · 2023-10-13T13:24:12Z

Is there anything else that needs to be done here? FWIW, I profiled lstrip, and didn't see the SubString constructor take any significant time (as suggested by this comment) - most time was spend in string iteration.

DilumAluthge · 2023-10-16T02:18:07Z

@vtjnash Could you give this another review?

vtjnash · 2023-10-23T15:23:17Z

It sounds like someone still needs to investigate and fix the actual performance issue (#51631 (comment))

jakobnissen · 2023-10-23T18:00:12Z

Okay, so I did two benchmarks. One simply measures the performance of pairs(::String), which is what this issue is really about. This is the benchmark

function time_pairs(s::String)
    n = 0
    for (i, c) in pairs(s)
        n = xor(n, reinterpret(UInt32, c) * i)
    end
    n
end

for source in ["en", "dk", "cn"]
    src = read(source, String)
    short = first(src, 20)
    println(source, " long:  ", @belapsed time_pairs($src))
    println(source, " short: ", @belapsed time_pairs($short))
end

With three difference sources: English (all ASCII), Danish (mostly ASCII) and Chinese (little ASCII). Results

            master    #51671    #51671 & this
en, short  53.8 ns    37.3 ns   19.9 ns    
en, long   21.7 us    17.4 us   11.8 us
da, short  57.3 ns    26.1 ns   19.6 ns
da, long   28.9 us    13.1 us    9.3 us
cn, short  116  ns    112  ns   52.9  us
cn, long   16.9 us    18.3 us    9.7 us

And here, for a simple lstrip benchmark:

bar(s) = sum(ncodeunits(lstrip(c -> reinterpret(UInt32, c) >>> 24 > 10, s)) for i in 1:10000000)

This takes ~950 ms on #51671, and ~640 ms on #51671 + this PR
This uses - on #51671 + this approximately:

91% of time in lstrip
Of that, approximately 18% in SubString
44% in iterate
I don't know what the rest is. Most of it shows up as "overhead" of lstrip in benchmarking. The benchmark does not allocate.

So, this PR is a clear win.

jakobnissen · 2023-10-26T15:16:43Z

Bump. What needs to be done to get this through?

The default `pairs` will iterate keys and values separately. For strings, this represents double work, since both these iterations will need to determine valid string indices. The introduced StringPairs type will, whenever possible, only compute valid indices once. Currently, this is only optimised for `String` and `SubString{String}`, and not for `AbstractString`, nor is it optimised when reversed.

jakobnissen · 2023-10-29T08:38:28Z

I've added "needs nanosoldier" because the noinline in string iteration may have unintended side effects. For example, it makes iteration of Cyrillic 2x slower and Chinese 1.6x slower. On the other hand, it allows iterate to inline itself so it will be faster for ASCII in many contexts. Just like the recent changes to nextind. So let's check its impact.
If the impact is unacceptable then I'll close this PR because then I don't know how to add the algorithmic improvement in this PR without causing regressions

vtjnash · 2024-04-16T20:22:10Z

@nanosoldier runbenchmarks("shootout" && "misc" && "io" && "string" && "strings", vs=":master")

vtjnash · 2024-04-16T20:23:40Z

It may still be worthwhile for someone to figure out why this isn't handled in the compiler already (past analysis suggested it was a problem with the way it used the SubString constructor #51631 (comment)), but might as well take the win now

nanosoldier · 2024-04-16T20:40:30Z

Your benchmark job has completed, but no benchmarks were actually executed. Perhaps your tag predicate contains misspelled tags? cc @

vtjnash · 2024-04-16T20:44:31Z

@nanosoldier runbenchmarks("shootout" || "misc" || "io" || "string" || "strings", vs=":master")

nanosoldier · 2024-04-16T21:12:11Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

jakobnissen · 2024-05-25T11:47:13Z

@DilumAluthge thanks for keeping this up to date. This is good to go now.

jakobnissen · 2024-06-03T06:39:20Z

Bump

jakobnissen added performance Must go faster strings "Strings!" labels Oct 7, 2023

jakobnissen force-pushed the string_pairs branch from 9fe3653 to 42170bf Compare October 11, 2023 12:06

StefanKarpinski approved these changes Oct 11, 2023

View reviewed changes

jakobnissen added the merge me PR is reviewed. Merge when all tests are passing label Oct 11, 2023

vtjnash reviewed Oct 11, 2023

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

vtjnash reviewed Oct 11, 2023

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

vtjnash reviewed Oct 11, 2023

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

jakobnissen added merge me PR is reviewed. Merge when all tests are passing and removed merge me PR is reviewed. Merge when all tests are passing labels Oct 11, 2023

DilumAluthge requested a review from vtjnash October 16, 2023 02:08

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Oct 16, 2023

jakobnissen force-pushed the string_pairs branch from f08ac97 to 5f83be4 Compare October 28, 2023 10:48

Generalize to IterableStatePairs

8d2699c

jakobnissen force-pushed the string_pairs branch from 5f83be4 to 8d2699c Compare October 28, 2023 10:54

jakobnissen added the needs nanosoldier run This PR should have benchmarks run on it label Oct 29, 2023

Merge branch 'master' into string_pairs

95ee015

vtjnash added merge me PR is reviewed. Merge when all tests are passing and removed needs nanosoldier run This PR should have benchmarks run on it labels Apr 16, 2024

DilumAluthge and others added 3 commits May 17, 2024 19:54

Merge branch 'master' into string_pairs

504dfcb

Merge branch 'master' into string_pairs

bb5d5af

Refine iterator type

1b273e5

jakobnissen mentioned this pull request May 25, 2024

Add a contributing guide in the README JuliaLang/StyledStrings.jl#63

Closed

jakobnissen added 2 commits May 25, 2024 11:26

Style nit

9ca5509

Refine type

1d30dcc

DilumAluthge merged commit 5897a92 into JuliaLang:master Jun 3, 2024
7 checks passed

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Jun 3, 2024

jakobnissen deleted the string_pairs branch June 4, 2024 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement StringPairs for efficient pairs(::AbstractString) #51631

Implement StringPairs for efficient pairs(::AbstractString) #51631

jakobnissen commented Oct 7, 2023

StefanKarpinski left a comment

oscardssmith commented Oct 11, 2023

jakobnissen commented Oct 11, 2023

jakobnissen commented Oct 13, 2023

DilumAluthge commented Oct 16, 2023

vtjnash commented Oct 23, 2023 •

edited

Loading

jakobnissen commented Oct 23, 2023 •

edited

Loading

jakobnissen commented Oct 26, 2023

jakobnissen commented Oct 29, 2023

vtjnash commented Apr 16, 2024

vtjnash commented Apr 16, 2024

nanosoldier commented Apr 16, 2024

vtjnash commented Apr 16, 2024

nanosoldier commented Apr 16, 2024

jakobnissen commented May 25, 2024

jakobnissen commented Jun 3, 2024

Implement StringPairs for efficient pairs(::AbstractString) #51631

Implement StringPairs for efficient pairs(::AbstractString) #51631

Conversation

jakobnissen commented Oct 7, 2023

StefanKarpinski left a comment

Choose a reason for hiding this comment

oscardssmith commented Oct 11, 2023

jakobnissen commented Oct 11, 2023

jakobnissen commented Oct 13, 2023

DilumAluthge commented Oct 16, 2023

vtjnash commented Oct 23, 2023 • edited Loading

jakobnissen commented Oct 23, 2023 • edited Loading

jakobnissen commented Oct 26, 2023

jakobnissen commented Oct 29, 2023

vtjnash commented Apr 16, 2024

vtjnash commented Apr 16, 2024

nanosoldier commented Apr 16, 2024

vtjnash commented Apr 16, 2024

nanosoldier commented Apr 16, 2024

jakobnissen commented May 25, 2024

jakobnissen commented Jun 3, 2024

vtjnash commented Oct 23, 2023 •

edited

Loading

jakobnissen commented Oct 23, 2023 •

edited

Loading