Use codepoint index for indices/1, index/1 and rindex/1 #3065

wader · 2024-03-12T09:01:00Z

Previsouly byte index was used.

Fixes #1430, fixes #1624, fixes #3064.

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

wader · 2024-03-12T09:02:58Z

src/jv.c

    while ((p = _jq_memmem(p, (jstr + jlen) - p, idxstr, idxlen)) != NULL) {
-      a = jv_array_append(a, jv_number(p - jstr));
+      while (lp < p) {


To make this even more efficient i guess we would need to count codepoints inside memmem somehow

wader · 2024-03-12T09:12:56Z

Haven't entirely convinced myself yet that it should be fine to look for matches using the byte representation. Assuming both the needle and haystack is valid utf-8 i'm thinking it should be fine because of utf-8's self-synchronization property.

Update: now looking at jv_string_slice

jq/src/jv.c

Line 1374 in c95b34f

jv jv_string_slice(jv j, int start, int end) {

i'm not sure anymore if one can assume strings are valid utf-8 or is the invalid utf-8 checks not really needed?

itchyny · 2024-08-20T12:33:32Z

I'd like to include this. Any objection on changing the behavior in 1.8?

wader · 2024-08-24T14:56:12Z

Ok to merge for me but would be great if someone could have a look or know if my assumption about strings always being valid utf-8 is true.

pkoppstein · 2024-08-24T17:21:21Z

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":

echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

wader · 2024-08-24T17:48:29Z

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

I can't see how the current behaviour for non-ASCII strings makes any sense or could even be useful in any resonable way? so for me it feels more like a bug.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":
echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

This is an incomplete surrogates pair? yeap stuff like this i'm concerned about also.

wader · 2024-08-24T20:29:12Z

With this change:

$ echo '"a\uDD1Ec"' | ./jq -c '[index("c"), length]'
[2,3]

Seems correct assuming broken surrogates codepoints should be allowed. But I think i'm mostly concern if there is any way to produce jq strings that has a byte buffer that is not valid utf-8. If so use of jvp_utf8_decode_length might end up out-of-sync codepoint-wise or pointing outside the byte buffer.

wader · 2024-11-17T09:22:02Z

As @nicowilliams also expressed this is a bug #1430 (comment) ill merge this now

thaliaarchi · 2024-11-19T08:22:55Z

Should the docs be updated to make it clear that it's a codepoint index, instead of byte index?

wader · 2024-11-19T09:11:29Z

Good point, i did quick skimming of the docs and it seems like we don't say it's byte offset anywhere but we don't make it that clear that it's codepoints either. So maybe mention for the index-functions and possible also under "Array/String Slice" and/or "Types and Values"? for the regexp functions we do mention things are in codepoints.

Use codepoint index for indices/1, index/ 1 and rindex/1

ca38058

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

wader commented Mar 12, 2024

View reviewed changes

itchyny added this to the 1.8 release milestone Mar 13, 2024

emanuele6 added feature request libjq labels Apr 29, 2024

wader requested a review from nicowilliams May 14, 2024 21:58

emanuele6 approved these changes Jul 12, 2024

View reviewed changes

itchyny changed the title ~~Use codepoint index for indices/1, index/ 1 and rindex/1~~ Use codepoint index for indices/1, index/1 and rindex/1 Aug 20, 2024

itchyny approved these changes Nov 17, 2024

View reviewed changes

wader merged commit 8619f8a into jqlang:master Nov 17, 2024
28 checks passed

wader deleted the indices-codepoints branch November 17, 2024 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Use codepoint index for indices/1, index/1 and rindex/1 #3065

wader commented Mar 12, 2024 •

edited by itchyny

Loading

wader Mar 12, 2024

wader commented Mar 12, 2024 •

edited

Loading

itchyny commented Aug 20, 2024

wader commented Aug 24, 2024

pkoppstein commented Aug 24, 2024 •

edited

Loading

wader commented Aug 24, 2024

wader commented Aug 24, 2024

wader commented Nov 17, 2024

thaliaarchi commented Nov 19, 2024

wader commented Nov 19, 2024

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Conversation

wader commented Mar 12, 2024 • edited by itchyny Loading

wader Mar 12, 2024

Choose a reason for hiding this comment

wader commented Mar 12, 2024 • edited Loading

itchyny commented Aug 20, 2024

wader commented Aug 24, 2024

pkoppstein commented Aug 24, 2024 • edited Loading

wader commented Aug 24, 2024

wader commented Aug 24, 2024

wader commented Nov 17, 2024

thaliaarchi commented Nov 19, 2024

wader commented Nov 19, 2024

wader commented Mar 12, 2024 •

edited by itchyny

Loading

wader commented Mar 12, 2024 •

edited

Loading

pkoppstein commented Aug 24, 2024 •

edited

Loading