-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use codepoint index for indices/1, index/1 and rindex/1 #3065
Conversation
Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064
while ((p = _jq_memmem(p, (jstr + jlen) - p, idxstr, idxlen)) != NULL) { | ||
a = jv_array_append(a, jv_number(p - jstr)); | ||
while (lp < p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this even more efficient i guess we would need to count codepoints inside memmem somehow
Haven't entirely convinced myself yet that it should be fine to look for matches using the byte representation. Assuming both the needle and haystack is valid utf-8 i'm thinking it should be fine because of utf-8's self-synchronization property. Update: now looking at Line 1374 in c95b34f
|
I'd like to include this. Any objection on changing the behavior in 1.8? |
Ok to merge for me but would be great if someone could have a look or know if my assumption about strings always being valid utf-8 is true. |
@itchyny asked:
This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it. @wader wrote:
Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count. Perhaps a starting point would be
|
I can't see how the current behaviour for non-ASCII strings makes any sense or could even be useful in any resonable way? so for me it feels more like a bug.
This is an incomplete surrogates pair? yeap stuff like this i'm concerned about also. |
With this change: $ echo '"a\uDD1Ec"' | ./jq -c '[index("c"), length]'
[2,3] Seems correct assuming broken surrogates codepoints should be allowed. But I think i'm mostly concern if there is any way to produce jq strings that has a byte buffer that is not valid utf-8. If so use of |
As @nicowilliams also expressed this is a bug #1430 (comment) ill merge this now |
Should the docs be updated to make it clear that it's a codepoint index, instead of byte index? |
Good point, i did quick skimming of the docs and it seems like we don't say it's byte offset anywhere but we don't make it that clear that it's codepoints either. So maybe mention for the index-functions and possible also under "Array/String Slice" and/or "Types and Values"? for the regexp functions we do mention things are in codepoints. |
Previsouly byte index was used.
Fixes #1430, fixes #1624, fixes #3064.