Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Merged
merged 1 commit into from
Nov 17, 2024

Conversation

wader
Copy link
Member

@wader wader commented Mar 12, 2024

Previsouly byte index was used.

Fixes #1430, fixes #1624, fixes #3064.

while ((p = _jq_memmem(p, (jstr + jlen) - p, idxstr, idxlen)) != NULL) {
a = jv_array_append(a, jv_number(p - jstr));
while (lp < p) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this even more efficient i guess we would need to count codepoints inside memmem somehow

@wader
Copy link
Member Author

wader commented Mar 12, 2024

Haven't entirely convinced myself yet that it should be fine to look for matches using the byte representation. Assuming both the needle and haystack is valid utf-8 i'm thinking it should be fine because of utf-8's self-synchronization property.

Update: now looking at jv_string_slice

jq/src/jv.c

Line 1374 in c95b34f

jv jv_string_slice(jv j, int start, int end) {
i'm not sure anymore if one can assume strings are valid utf-8 or is the invalid utf-8 checks not really needed?

@itchyny itchyny changed the title Use codepoint index for indices/1, index/ 1 and rindex/1 Use codepoint index for indices/1, index/1 and rindex/1 Aug 20, 2024
@itchyny
Copy link
Contributor

itchyny commented Aug 20, 2024

I'd like to include this. Any objection on changing the behavior in 1.8?

@wader
Copy link
Member Author

wader commented Aug 24, 2024

Ok to merge for me but would be great if someone could have a look or know if my assumption about strings always being valid utf-8 is true.

@pkoppstein
Copy link
Contributor

pkoppstein commented Aug 24, 2024

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":

echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

@wader
Copy link
Member Author

wader commented Aug 24, 2024

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

I can't see how the current behaviour for non-ASCII strings makes any sense or could even be useful in any resonable way? so for me it feels more like a bug.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":

echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

This is an incomplete surrogates pair? yeap stuff like this i'm concerned about also.

@wader
Copy link
Member Author

wader commented Aug 24, 2024

With this change:

$ echo '"a\uDD1Ec"' | ./jq -c '[index("c"), length]'
[2,3]

Seems correct assuming broken surrogates codepoints should be allowed. But I think i'm mostly concern if there is any way to produce jq strings that has a byte buffer that is not valid utf-8. If so use of jvp_utf8_decode_length might end up out-of-sync codepoint-wise or pointing outside the byte buffer.

@wader
Copy link
Member Author

wader commented Nov 17, 2024

As @nicowilliams also expressed this is a bug #1430 (comment) ill merge this now

@wader wader merged commit 8619f8a into jqlang:master Nov 17, 2024
28 checks passed
@wader wader deleted the indices-codepoints branch November 17, 2024 09:22
@thaliaarchi
Copy link
Contributor

Should the docs be updated to make it clear that it's a codepoint index, instead of byte index?

@wader
Copy link
Member Author

wader commented Nov 19, 2024

Good point, i did quick skimming of the docs and it seems like we don't say it's byte offset anywhere but we don't make it that clear that it's codepoints either. So maybe mention for the index-functions and possible also under "Array/String Slice" and/or "Types and Values"? for the regexp functions we do mention things are in codepoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants