Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SubString nextind/prevind #24255

Closed
wants to merge 6 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 21, 2017

This PR introduces the following changes:

  • improve performance of nextind and prevind with nchar for SubString{String} (now it does not use generic prevind/nextind but a specialized method for String)
  • make prevind and nextind work consistently with any type of SubString:
    • if the original string (string field) is String or DirectIndexString then prevind/nextind simply shifts the output by offset as in this case prevind can return negative values;
    • if the original string is not String then we have to handle the restriction that prevind in this case should not return negative values (i.e. the minimum value of prevind is 0);
  • fix a bug that made it possible that nextind returned a non-positive value, e.g. in the expression nextind(SubString("1234234", 3), -6); the general contract is that nextind must always return a positive index; now it is checked; a similar fix is implemented for prevind returning something beyond endof (e.g. prevind(SubString("12345678", 1, 1), 10)).

(the first change improves performance of new functionality in 0.7, the second and third change fix wrong behavior that was present in 0.6)

@nalimilan The only area I do not touch are behavior of prevind and nextind for DirectIndexString as I understand they are to be removed anyway (and they do not work consistently with all other functionality for strings now anyway).

The test code for this is a bit complex - I can separate different cases (but the test code will be more verbose) if more clarity is required.

@nalimilan
Copy link
Member

I don't understand the changes you made for the 2-argument methods. Didn't you completely remove the generic SubString fallback? That doesn't sound consistent with the explanation in your second bullet.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

I do not have to define nextind/prevind for generic SubString{X} where X is something else than String because this will be handled by nextind(<:AbstractString) and prevind(<:AbstractString) as SubString<AbstractString.

Actually I had to do something opposite - in nextind(s::SubString{String}, i::Integer) and prevind(s::SubString{String}, i::Integer) I add {String} to make them fallback to default nextind and prevind in case when SubString is wrapping something else than String.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

To see this you can look at dispatch of prevind(SubString("12345678", 1, 1), 10) for example before and after the change.

The problem is that using offset in strings/types.jl is a nice optimization but it does not take into account disconuities of prevind and nextind functions at boundaries of the strings.

let strs = Any["∀α>β:α+1>β", GenericString("∀α>β:α+1>β")]
for i in 1:2
let str = "∀α>β:α+1>β", strs = Any[str, GenericString(str),
SubString(str, 1), SubString(str, 1, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those multibyte characters - on my terminal it looked correct :(.

@test nextind(SubString("1234567", 3), 3, 2) == 5
@test nextind(SubString("1234567", 3), 4, 2) == 6

@test prevind(SubString(GenericString("1234567"), 1, 3), 10) == 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these be merged with the previous tests using a loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what I have done previously, but the discontinuity at 1 and endof(s) made it problematic (the test code would be very messy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something. These lines are identical to those above except for the GenericString part, so what's the problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put into the loop all similar cases and left the differences below.

@nalimilan
Copy link
Member

OK. Can you add commit messages explaining changes like you did in the PR? It would be useful to mention what you just said. Better keep two separate commits as it makes changes clearer.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

Maybe I will add comments in the source code? The changes are complex and I could not think of good brief commit messages. OK?

function nextind(s::SubString{String}, i::Integer)
# handle the case when i+s.offset is greater than 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"smaller"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten, hopefully it is clearer now

i < 1 && return 1
nextind(s.string, i+s.offset)-s.offset
end

function prevind(s::SubString{String}, i::Integer)
e = endof(s)
# handle the case when i+s.offset is greater than endof(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than endof(s.string)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten, hopefully it is clearer now

i > e && return e
prevind(s.string, i+s.offset)-s.offset
end

function nextind(s::SubString{String}, i::Integer, nchar::Integer)
# nextind for nchar=1 must return nonnegative value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this would apply only to "nchar=1".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten, hopefully it is clearer now

@nalimilan
Copy link
Member

Commit messages don't need to be brief. I'd even advocate the contrary.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

OK - I have reorganized the commits.

# make sure that value not greater than endof(s) is returned
j = Int(i)
j > e && return e
# the transofrmation below is valid if j<=endof(s)+1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"transformation". Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

end
end

@test prevind(SubString("1234567", 1, 3), 0) == -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return -1 in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because prevind("1234567", 0) == -1. It is an inconsistency between prevind for String and for AbstractString.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

CI e87ec73 produces strange StackOverflow due to recursion in tests (when serialize is called - do not understand why in this case it is called at all).
So 25ac0a0 provides an additional function that I hope solves this problem (although it is an unrelated to the core of this PR). In short currently SubString could be nested, e.g.:

x = "123"
y = SubString(x, 1, 2)
z1 = SubString{typeof(y)}(y, 1, 2)
z2 = SubString(y, 1, 2)

produced:

julia> typeof(z1)
SubString{SubString{String}}

julia> typeof(z2)
SubString{String}

I assume that the way z1 should not be allowed and z2 is correct.

@@ -46,6 +46,8 @@ function SubString(s::SubString, i::Int, j::Int)
j <= endof(s) || throw(BoundsError(s, j))
SubString(s.string, s.offset + i, s.offset + j)
end
SubString{T}(s::T, i::Int, j::Int) where {T<:SubString} = SubString(s, i, j)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this method if it's not needed (since it didn't fix the problem).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will move it to a separate PR

@ararslan ararslan added the strings "Strings!" label Oct 21, 2017
@nalimilan
Copy link
Member

This looks good in terms of implementation, but can you tell me more about the logic behind the rules themselves? Is there a strong reason to fix incorrect indices? If the user passed an out of bounds index, we could decide that the only guaranty we make is that an out of bounds index will be returned, without specifying the exact rule? Overall I'd be tempted to choose the rule that is the simplest and the most efficient to implement: it would be too bad to lose a bit of performance because of our handling of invalid cases.

There's also the slightly weird discrepancy between String and GenericString. Could anything be done about it?

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2017

The key element of the logic in this PR is the following:

Currently we allow out of range indices to nextind/prevind and for any s<:AbstractString the general rule was:

  • if i<start(s), then nextind(s, i) should return start(s);
  • if i>endof(s), then prevind(s, i) should return endof(s);

In both cases there is a jump to these two values no matter how low or high i respectively is.

With SubString the tricky part is that is has to make this jump at its boundaries not at boundaries of the parent string. This PR makes SubString obey the rules given above.

String and GenericString discrepancy is due to the fact that they return different values for out-of-bounds indices (so as you say - if we do not specify what should be returned - different types can return different things and tests have to take this into account).

Some time ago I tried to unify it but the only fix would be to make prevind return 0 always on index i<=1 and nextind return endof(s)+1 always on index i>=endof(s). However, this rule produced strange errors on CI (apparently some legacy code depended on other behavior of nextind and prevind).
I can go back to it if we find it valuable, but I guess this should be a different PR.

PS. I am still working on making tests pass because of String and GenericString discrepancy :(.

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2017

I have rewritten the offending part of the tests to check the contract only (negative or greater than endof(s) value without testing for the exact number) - now these tests are the same for String and GenericString.

@nalimilan
Copy link
Member

Thanks for the summary. You shouldn't be afraid of putting these details in commit messages, that can't hurt. It would be interesting indeed if you can come up with a PR unifying behaviors. The code which relies on these implementation details would better be fixed anyway.

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2017

PR passing the tests finally.

The key difficulty was that it is natural for nextind(s, endof(s)) to return sizeof(s)+1, as this would be a next index if the string was larger. Unfortunately sizeof(s) does not have to be defined in general. But I am putting it on a list to think of.

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2017

rebased

@nalimilan
Copy link
Member

The key difficulty was that it is natural for nextind(s, endof(s)) to return sizeof(s)+1, as this would be a next index if the string was larger. Unfortunately sizeof(s) does not have to be defined in general. But I am putting it on a list to think of.

I wouldn't care too much about what is "natural". An out of bounds index is invalid, we don't really care whether it's the first index after the end of the string or not, anyway you can't use it for anything. We just have to ensure any invalid index gives the same result e.g. when passed to nextind/prevind as you do here.

@bkamins
Copy link
Member Author

bkamins commented Nov 21, 2017

The CI fail seems to be unrelated. Should it be merged (part of this PR is fixing a bug in nexting/prevind that is present in 0.6)

@stevengj
Copy link
Member

stevengj commented Nov 21, 2017

I wouldn't care too much about what is "natural". An out of bounds index is invalid, we don't really care whether it's the first index after the end of the string or not, anyway you can't use it for anything.

I don't know, there might well be code out there that relies on nextind(s, endof(s)) giving 1 + the number of code units (not necessarily sizeof(s)+1!). For example, the proposed ncodeunits function in #24613 relies on this.

Is that behavior changed by this PR?

@bkamins
Copy link
Member Author

bkamins commented Nov 21, 2017

@stevengj The change in the area you are asking about is from:

nextind(s::SubString, i::Integer) = nextind(s.string, i+s.offset)-s.offset

to

function nextind(s::SubString{String}, i::Integer)
    # make sure that nonnegative value is returned
    j = Int(i)
    j < 1 && return 1
    # the transformation below is valid if j>=0
    nextind(s.string, j+s.offset)-s.offset
end

It is needed because the old code could produce negative values of nextind, e.g:

julia> s = SubString("1232342342342", 10, 9)
""

julia> nextind(s, -10)
-8

In the new code nextind(s, 0) always returns 1. In the old code we had:

julia> s = SubString("∀∀∀", 2, 1)
""

julia> nextind(s, endof(s))
3

So the question is (especially with relation to #24613 - I do not know the details of that PR) - what is the desired functionality of nextind(s, endof(s)) is s is an empty SubString created in such a special way as in the example above - do we want to return 1 always or we want to look at the underlying String and act accordingly?

@stevengj
Copy link
Member

stevengj commented Nov 21, 2017

Returning 1 seems like the desired behavior to me, thanks. (I think in general we want nextind(s, endof(s)) to return 1 plus the number of code units in s. If s is an empty substring, then the number of code units is zero.)

@StefanKarpinski
Copy link
Member

In general – and I think we should spell this out explicitly – it seems like strings should behave as if there is a 1-code-unit character before the string at index start(s) - 1 and an infinite-code-unit character after the end of the string at index ncodeunits(s) + 1. Retrieving characters at these positions is invalid, but they should otherwise be treated as the starts of characters for index arithmetic. We've generally assumed that start(s) == 1 so maybe we should just make that official; we should also make the code unit model official and spell out that which functions (like reverseind) assume that if you reorder code points the code units associated with a particular code point remain the same.

@nalimilan
Copy link
Member

@StefanKarpinski And how about merging? :-)

@StefanKarpinski
Copy link
Member

👍 Needs a conflict resolution that wasn't immediately obvious to me.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2017

@StefanKarpinski regarding codeunit arithmetics. +1 for the proposal to set start(s)=1 as people assume it in the code and it is better to fix it.

Regarding:

start(s) - 1 and an infinite-code-unit character after the end of the string at index ncodeunits(s) + 1. Retrieving characters at these positions is invalid, but they should otherwise be treated as the starts of characters for index arithmetic.

How prevind(s, -2) should behave. Currently if s::String it will return -3 - should it return 0?
Similarly e.g. with nextind(s, ncodeunits(s)+10) - now it will return ncodeunits(s)+11 for s::String - should it always return ncodeunits(s)+1.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2017

@StefanKarpinski conflict resolved (nothing required changing - git got confused by merge of thisind)

@StefanKarpinski
Copy link
Member

Another model we could use here is that there are an infinite number of single-code-unit characters before and after the end of the string. That way prevind^k(s, nextind^k(s, i)) will always give back i. But I'm not entirely sure what the use case for k > 1 is in this kind of arithmetic.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 22, 2017

I guess another benefit of the infinite 1-code-unit chars model is that prevind(s, i) ≤ i and nextind(s, i) ≥ i are both always true, whereas, (as you point out) one could get violations of that when start(s) ≤ i ≤ ncodeunits(s) is not true in the other model.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2017

@StefanKarpinski

1-code-unit chars

I am OK also with such an approach, but it also requires fixing prevind (this time for AbstractString as now it always returns 0 if I remember correctly).

In general now prevind and nextind are not 100% consistent between String and AbstractString - any arithmetic we choose should then be propagated accordingly.

@StefanKarpinski
Copy link
Member

I plan on taking another pass through all of this stuff to make it more consistent and try to document and test things more thoroughly. Long overdue but now happening in the eleventh hour :)

@bkamins
Copy link
Member Author

bkamins commented Nov 23, 2017

CI failure seems unrelated.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 24, 2017

Having thought about this for a while, I think the most conservative approach is probably best for now: allow returning and accepting start(s) - 1 and ncodeunits(s) + 1 as string indices for arithmetic, but throw an error if any other out-of-bounds index is passed to functions accepting indices into s. Since this is compatible with both of the above views, we can change to either model in the future without breaking code if decide it's desirable.

@bkamins
Copy link
Member Author

bkamins commented Dec 24, 2017

This PR is outdated and should be closed, but I am not sure if current implementation is what was intended (I was unable to follow the whole discussion during String redesign):

@StefanKarpinski What I find inconsistent is the following behavior of nextind/prevind:

julia> x = "1∀∀∀∀∀"
"1∀∀∀∀∀"

julia> y = SubString(x, 1, 1)
"1"

julia> z = String(y)
"1"

julia> nextind(y, 1, 3)
8

julia> nextind(z, 1, 3)
4

I guess it was not intended (but maybe it is). I would assume that y should behave like z disregarding the underlying string x. This would be consistent with docstring:

If i is out of bounds in s return i + 1.

Also, at least for my version of Julia:

julia> versioninfo()
Julia Version 0.7.0-DEV.3078
Commit 8b01dfa* (2017-12-18 16:49 UTC)

The docstring for nextind/prevind seem incorrect, e.g. last example:

 julia> nextind(str, 9)
  10

throws BoundsError.

nextind(s::SubString, i::Integer) = nextind(s.string, i+s.offset)-s.offset
prevind(s::SubString, i::Integer) = prevind(s.string, i+s.offset)-s.offset
# need to define nextind and prevind only for SubString{String}
# as other cases are handled by definitions for AbstractString
Copy link
Member

@stevengj stevengj Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These generic definitions for SubString{T} should be more efficient than the generic nextind for AbstractString if there is an efficient nextind for T. So, better to leave them in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, I guess we already discussed that above?:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is outdated (it was prepared before "string overhaul"), but the problem it solved is still present in the new code.

Therefore my thinking is to first decide on the desired behavior (I guess @StefanKarpinski will have an opinion here).

Next there is an issue of maximum efficiency of implementation. My experience is that at least SubString{String} should have a specialized method - as this will be most common case in practice and probably there is a room for efficiency improvement by using custom code.

But first we should have a defined contract we want the implementation to follow.

@bkamins
Copy link
Member Author

bkamins commented Feb 4, 2018

Closed by #25531.

@bkamins bkamins closed this Feb 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants