-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of readuntil #20621
Conversation
base/io.jl
Outdated
m[l] = c | ||
end | ||
if i >= l && m == t | ||
i = c == t[i] ? i + 1 : 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to advance 'i' here by a whole character to handle Unicode. I think it'll be easiest to do that by using start/next/done instead of 1/+1/l
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code i
represents a character number and not a index into the string.
I think this proposed algorithm would fail with a terminator like "aab" matching against "aaab" (or in general, any string where the first character is repeated anywhere else in the string) |
Benchmarks for latest version. Note the original benchmarks weren't rewinding the stream correctly so I also update the description's benchmarks. julia> using BenchmarkTools
julia> str = String([rand('A':'Z', 50000); '1']);
julia> io = IOBuffer(str);
julia> goal = str[end-1000:end];
julia> b1 = @benchmark readuntil_old(seekstart($io), $goal);
julia> b2 = @benchmark readuntil_new(seekstart($io), $goal);
julia>judge(median(b2),median(b1))
BenchmarkTools.TrialJudgement:
time: -95.66% => improvement (5.00% tolerance)
memory: +5.27% => regression (1.00% tolerance)
julia> goal = str[end:end];
julia> b1 = @benchmark readuntil_old(seekstart($io), $goal);
julia> b2 = @benchmark readuntil_new(seekstart($io), $goal);
julia>judge(median(b2),median(b1))
BenchmarkTools.TrialJudgement:
time: -16.59% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
julia> goal = str;
julia> b1 = @benchmark readuntil_old(seekstart($io), $goal);
julia> b2 = @benchmark readuntil_new(seekstart($io), $goal);
julia>judge(median(b2),median(b1))
BenchmarkTools.TrialJudgement:
time: +0.27% => invariant (5.00% tolerance)
memory: +42.69% => regression (1.00% tolerance) Memory regression is caused from the new array |
Heavily inspired by omus and #20621
Travis failure appears unrelated. |
Heavily inspired by omus and #20621
|
I have a unicode test I want to add but currently that results in: Error During Test
Test threw an exception of type Base.UVError
Expression: readuntil(io(t), s) == m
read: network is down (ENETDOWN) |
Wat? |
It looks like the I/O producers "File" and "PipeEndpoint" choke with unicode input in test/read.jl |
a55cbf7
to
4237a46
Compare
Heavily inspired by omus and #20621
Heavily inspired by omus and #20621
Looking over my original test I realized it potentially could be offensive. I've used a different example to avoid any potential issues.
Caches backtracking information as it is needed. Using a SparseVector which has a lower memory footprint than Vector but is more performant than Dict.
Skip testing the I/O producers "File" and "PipeEndpoint" when working with unicode.
4237a46
to
e2b5902
Compare
makes it possible to use readuntil with any array (indexable) object and optimizes a few more cases
e2b5902
to
938793d
Compare
@omus PTAL at my latest updates to your code :) |
Looks really solid. I'll try to dig up some of my old benchmarks and give this a spin. |
To be slightly more fair to the current algorithm, I was using the following modified version: function readuntil_old(s::IO, r::Vector{UInt8})
l = sizeof(r)
if l == 0
return ""
end
out = Base.StringVector(0)
m = Array{UInt8}(l) # last part of stream to match
i = 0
while !eof(s)
i += 1
c = read(s, UInt8)
push!(out, c)
if i <= l
m[i] = c
else
# shift to last part of s
for j = 2:l
m[j-1] = m[j]
end
m[l] = c
end
if i >= l && m == r
break
end
end
return String(out)
end
@time let g2 = Vector{UInt8}(goal); for i in 1:100; readuntil_old(seekstart(io), g2); end; end I also added the worst case to my tests, where the new algo in this PR really shines: str = ("A" ^ 50000) * "B";
goal = ("A" ^ 5000) * "Z"; |
I think you need to remove the The only benchmark where the old algorithm was slightly faster is in this case: goal = "A" ^ 50000;
str = "A" ^ 50000; |
I'll make a PR to BaseBenchmarks. |
Nanosoldier is now ready to run the |
@nanosoldier |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please squash when merging
The new changes look great! |
Was looking into
readuntil
and found a way to improve performance. Below are some of the benchmarks I used for comparison:I'll add these benchmarks to BaseBenchmarks.jl