Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistencies of findnext and findprev functions #40120

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

bicycle1885
Copy link
Member

@bicycle1885 bicycle1885 commented Mar 20, 2021

This pull request fixes several inconsistent behaviors . The proposed changes are breaking in some edge cases.

There are mainly three behavioral changes:

Fixes #39940, #36768, #40006, and #40244.

Empty strings

The first change will be best depicted by the following example.

master:

julia> [findnext("", "abc", i) for i in 1:4]  # This is fine.
4-element Vector{UnitRange{Int64}}:
 1:0
 2:1
 3:2
 4:3

julia> [findprev("", "abc", i) for i in 0:3]  # Why duplicated 1:0 ?
4-element Vector{UnitRange{Int64}}:
 1:0
 1:0
 2:1
 3:2

proposed:

julia> [findnext("", "abc", i) for i in 1:4]  # no changes
4-element Vector{UnitRange{Int64}}:
 1:0
 2:1
 3:2
 4:3

julia> [findprev("", "abc", i) for i in 0:3]  # now consistent
4-element Vector{UnitRange{Int64}}:
 1:0
 2:1
 3:2
 4:3

This also changes findlast("", b) if b is not an empty string.

julia> findlast("", "abc")  # master
3:2

julia> findlast("", "abc")  # proposed
4:3

Bounds check

The second change makes it possible to specify any position larger than the last index of the second string:

master:

julia> findnext("a", "abc", 4)

julia> findnext("a", "abc", 5)
ERROR: BoundsError: attempt to access 3-codeunit String at index [5]

julia> findnext("a", "abc", 6)
ERROR: BoundsError: attempt to access 3-codeunit String at index [6]

proposed:

julia> findnext("a", "abc", 4)

julia> findnext("a", "abc", 5)

julia> findnext("a", "abc", 6)

I'm not sure why the current behavior is so strict, but the proposed behavior is more consistent with the following generic method of findnext:

julia/base/array.jl

Lines 1855 to 1866 in 6913f9c

function findnext(testf::Function, A, start)
i = oftype(first(keys(A)), start)
l = last(keys(A))
i > l && return nothing
while true
testf(A[i]) && return i
i == l && break
# nextind(A, l) can throw/overflow
i = nextind(A, i)
end
return nothing
end

findprev is also changed in the symmetric way.

master:

julia> findprev("a", "abc", 0)

julia> findprev("a", "abc", -1)
ERROR: BoundsError: attempt to access 3-codeunit String at index [-1]

julia> findprev("a", "abc", -2)
ERROR: BoundsError: attempt to access 3-codeunit String at index [-2]

proposed:

julia> findprev("a", "abc", 0)

julia> findprev("a", "abc", -1)

julia> findprev("a", "abc", -2)

Index alignment

The third change allows starting a search in the middle of a character in multibyte strings.

master:

julia> findnext("β", "αβ", 2)
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'α', [3]=>'β'

proposed:

julia> findnext("β", "αβ", 2)
3:3

This change simplify some code that uses findnext or findprev. For example, findall can be implemented in an obvious way:

function findall(a, b)
    result = UnitRange{Int}[]
    r = findfirst(a, b)
    while r !== nothing
        push!(result, r)
        r = findnext(a, b, first(r) + 1)  # no need to align the index with `nextind` 
    end
    return result
end

In my opinion, the index check is a kind of implementation detail and should be hidden from the user.

Performance

I also refactored the current code almost from scratch because it was difficult to fix inconsistencies without that. So, I quickly checked the performance degradation and discovered that the performance was slightly improved for string-string search.

This is a benchmark script.

using BenchmarkTools

b = "abracadabra"
for a in ['c', "cad"]
    @show a
    @btime findfirst($a, $b)
    @btime findlast($a, $b)
end
println()

b = "Julia is a high-level, high-performance dynamic language for technical computing.  The main homepage for Julia can be found at https://julialang.org/.  This is the GitHub repository of Julia source code, including instructions for compiling and installing Julia, below."
for a in ['H', "instructions"]
    @show a
    @btime findfirst($a, $b)
    @btime findlast($a, $b)
end

master:

a = 'c'
  9.456 ns (0 allocations: 0 bytes)
  10.872 ns (0 allocations: 0 bytes)
a = "cad"
  27.580 ns (2 allocations: 128 bytes)
  38.288 ns (2 allocations: 128 bytes)

a = 'H'
  11.613 ns (0 allocations: 0 bytes)
  10.871 ns (0 allocations: 0 bytes)
a = "instructions"
  89.698 ns (2 allocations: 128 bytes)
  55.495 ns (2 allocations: 128 bytes)

proposed:

a = 'c'
  7.572 ns (0 allocations: 0 bytes)
  10.169 ns (0 allocations: 0 bytes)
a = "cad"
  17.066 ns (0 allocations: 0 bytes)
  16.152 ns (0 allocations: 0 bytes)

a = 'H'
  9.457 ns (0 allocations: 0 bytes)
  10.489 ns (0 allocations: 0 bytes)
a = "instructions"
  74.781 ns (0 allocations: 0 bytes)
  28.898 ns (0 allocations: 0 bytes)

Breaking changes

I found several questionable behaviors while working on this pull request; some of them seem to be intentional because they are explicitly tested. One is about splitting a string with an empty string, which is already filed at #40117. Another is the following two test cases:

@test findprev(pattern, A, length(A)+1) == findlast(pattern, A)
@test findprev(pattern, A, length(A)+2) == findlast(pattern, A)

This suggests findprev should be more permissive in the right bound than findnext should be in the left bound (findnext("a", "abc", 0) throws BoundsError). This is also expected in other places such as

@test findprev("éé", "éé", lastindex("ééé")) == 1:3
. This seems intended but is not consistent. So, I've changed the behavior in the current proposal so that findprev("a", "abc", 4) (for example) throws an BoundsError exception, which is consistent with findnext and other methods of the findprev function. Technically speaking, this is a breaking change, and thus I'm not perfectly sure if this is acceptable.

@bicycle1885 bicycle1885 added search & find The find* family of functions strings "Strings!" labels Mar 20, 2021
Copy link
Member

@aviatesk aviatesk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some styling nits.

base/strings/search.jl Show resolved Hide resolved
base/strings/search.jl Outdated Show resolved Hide resolved
base/strings/search.jl Show resolved Hide resolved
@oscardssmith oscardssmith added the triage This should be discussed on a triage call label Mar 21, 2021
@oscardssmith
Copy link
Member

This looks great! Adding a triage label since it technically is breaking.

@oscardssmith
Copy link
Member

This might be kind of dumb, but could we do the string versions of this by reinterpreting to Array{UInt8} and using those methods? Since the indexing is the same, I think it might just work (and let us reduce a bunch of code).

@codecov-io
Copy link

Codecov Report

Merging #40120 (6913f9c) into master (561819b) will increase coverage by 0.05%.
The diff coverage is 95.36%.

❗ Current head 6913f9c differs from pull request most recent head 7f81540. Consider uploading reports for the commit 7f81540 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #40120      +/-   ##
==========================================
+ Coverage   87.33%   87.39%   +0.05%     
==========================================
  Files         390      390              
  Lines       75796    76077     +281     
==========================================
+ Hits        66197    66487     +290     
+ Misses       9599     9590       -9     
Impacted Files Coverage Δ
base/client.jl 62.59% <ø> (ø)
base/loading.jl 82.94% <ø> (-0.11%) ⬇️
stdlib/InteractiveUtils/src/editless.jl 10.58% <ø> (ø)
stdlib/Serialization/src/Serialization.jl 91.77% <75.00%> (-0.16%) ⬇️
base/compiler/ssair/passes.jl 89.65% <78.94%> (-0.20%) ⬇️
base/compiler/abstractinterpretation.jl 89.12% <79.41%> (+0.48%) ⬆️
base/compiler/inferencestate.jl 84.82% <88.88%> (+0.26%) ⬆️
stdlib/LinearAlgebra/src/dense.jl 99.42% <98.11%> (-0.18%) ⬇️
stdlib/LinearAlgebra/src/triangular.jl 96.83% <99.07%> (+0.30%) ⬆️
base/compiler/ssair/inlining.jl 89.52% <100.00%> (+0.01%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efad4e3...7f81540. Read the comment docs.

@bicycle1885
Copy link
Member Author

bicycle1885 commented Mar 21, 2021

could we do the string versions of this by reinterpreting to Array{UInt8} and using those methods?

In fact, the original code does that by wrapping a string s with unsafe_wrap(Vector{UInt8}, s). The problem is that it allocates an object for each string and the cost (~5 ns) is not negligible especially for short strings. I guess the performance improvement of this pull request is largely thanks to avoiding these object allocations. Also, considering that the amount of code has been reduced by the proposed change, the effect of code reduction seems limited.

@oscardssmith
Copy link
Member

triage says ARGHHHHHHH

@JeffBezanson
Copy link
Member

findprev("", "abc", 1) === 2:1 doesn't entirely make sense to me (and some others) --- how does going "previous" give a higher index?

@bicycle1885
Copy link
Member Author

bicycle1885 commented Mar 26, 2021

I think r = findprev(a, b, i) should return the rightmost range that satisfies a == b[r] && last(r) <= i (this constraint is documented). If we stick to this rule, findprev("", "abc", 1) === 2:1 is the only consistent behavior.

findprev(pattern::AbstractString, string::AbstractString, start::Integer)
Find the previous occurrence of `pattern` in `string` starting at position `start`.
The return value is a range of indices where the matching sequence is found, such that
`s[findprev(x, s, i)] == x`:
`findprev("substring", string, i)` == `start:stop` such that
`string[start:stop] == "substring"` and `stop <= i`, or `nothing` if unmatched.

That is, the third argument of findprev is the upper bound of the last index of a returned range, and the first index is not relevant. This is symmetric with the fact that the third argument of findnext is the lower bound of the first index of a returned range, and the last index is not relevant. Do you think findnext("", "abc", 1) === 1:0 is also strange?

@longemen3000
Copy link
Contributor

bump

@StefanKarpinski
Copy link
Member

Need to actually discuss this on a triage call.

@Seelengrab
Copy link
Contributor

The examples, fixes & consistency improvements look good, so I'd be in favor, but I haven't looked at the code in detail. Would need a rebase & some squashing though.

@oscardssmith do you happen to remember what there was to AAAAAAARGH about?

@oscardssmith
Copy link
Member

basically the problem is that there doesn't seem to be a good api that is fully consistent.

@brenhinkeller brenhinkeller removed the triage This should be discussed on a triage call label Nov 21, 2022
@brenhinkeller brenhinkeller added this to the 2.0 milestone Nov 21, 2022
@Moelf
Copy link
Contributor

Moelf commented May 25, 2023

I found several questionable behaviors while working on this pull request; some of them seem to be intentional because they are explicitly tested. ... Another is the following two test cases:

yeah, we added those tests here:

and I remember we added those tests because some code rely on that behavior and earlier version of that PR broke those code, thus we added a test.

@oscardssmith
Copy link
Member

findprev("", "abc", 1) === 2:1 doesn't entirely make sense to me (and some others) --- how does going "previous" give a higher index?

Triage was re-looking at this and the answer here is that for findfirst the start argument provides a lower bound on the first index, while symetrically, findprev should have the invariant that the start argument (possibly should be renamed) provides an upper bound on the last index.

@LilithHafner LilithHafner added the needs pkgeval Tests for all registered packages should be run with this change label May 25, 2023
@oscardssmith
Copy link
Member

after reviewing this with fresh eyes (and having found more bugs in the current implimentation) triage now believes that this is completely correct and should be merged (assuming pkgeval doesn't hate it).

One thing not made clear in the rationalle above is that we should have
master:

julia> findnext("a", "abc", -1)
ERROR: BoundsError: attempt to access 3-codeunit String at index [-1]
julia> findprev("a", "abc", 100)
ERROR: BoundsError: attempt to access 3-codeunit String at index [100]

proposed:

julia> findnext("a", "abc", -1)
1:1
julia> findprev("a", "abc", 100)
1:1

We also should make findnext and findprev for Vector have the same semantics as currently Vector{UInt8} behaves like String while all other Vectors don't let you go out of bounds at all.

@LilithHafner
Copy link
Member

IIUC, in all cases,
findprev(predicate, collection, start) should return the last valid index less than or equal to start for which the predicate returns true.
findnext(predicate, collection, start) should return the first valid index greater than or equal to start for which the predicate returns true.

@Moelf
Copy link
Contributor

Moelf commented May 26, 2023

We also should make findnext and findprev for Vector have the same semantics as currently Vector{UInt8} behaves like String while all other Vectors don't let you go out of bounds at all.

I wonder if people are really comfortable with Vector{UInt8} being different than other Vector{X}, maybe the solution is to have this behavior for Base.CodeUnits{UInt8, <:AbstractString}? (which means we will deprecate what we added in #37283)

currently, if user is handed a bytes = Vector{UInt8}, maybe from I/O, they just do codeunits(String(bytes)), which isn't too bad IMO. ❤️ to our String can hold malformed UTF

@oscardssmith
Copy link
Member

this proposal is to make the semantics of all vectors consistent with the semantics for strings.

@Moelf
Copy link
Contributor

Moelf commented May 26, 2023

but didn't you say

while all other Vectors don't let you go out of bounds at all.

are you saying:

current

julia> findprev(==(UInt8(1)), UInt8[1,2,3], 4)

julia> findprev(==(1), UInt8[1,2,3], 4)
ERROR: BoundsError: attempt to access 3-element Vector{UInt8} at index [4]
Stacktrace:
 [1] getindex
   @ ./essentials.jl:13 [inlined]
 [2] findprev(testf::Base.Fix2{typeof(==), Int64}, A::Vector{UInt8}, start::Int64)
   @ Base ./array.jl:2253
 [3] top-level scope
   @ REPL[23]:1

proposed?

julia> findprev(==(UInt8(1)), UInt8[1,2,3], 4)
1
julia> findprev(==(1), UInt8[1,2,3], 4)
1

@oscardssmith
Copy link
Member

yeah. sorry if that was unclear. I think @LilithHafner's description of triage intent is exactly correct. for all collection types, if there is a matching value before the index, findprev will return it, and will otherwise return nothing. (and likewise for findnext/after)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code needs pkgeval Tests for all registered packages should be run with this change search & find The find* family of functions strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

findprev/findlast with empty search string