position_from_utf16 in workspace.py may return incorrect result. #302

pyscripter · 2022-12-08T01:35:47Z

The following code that mimics the way position_from_utf16 calculates the result, demonstrates the issue:

def utf16_unit_offset(chars: str):
    """Calculate the number of characters which need two utf-16 code units.
    Arguments:
        chars (str): The string to count occurrences of utf-16 code units for.
    """
    return sum(ord(ch) > 0xFFFF for ch in chars)

s = '😋😋'

def position_from_utf16(line, pos):
    return pos - utf16_unit_offset(line[:pos])

print(position_from_utf16(s, 2))

The result is zero, but it should be one.

Although, this may not have a large real-world impact (you don't get many high-plane chars in python code), it would be nice to have a correct calculation.

pyscripter · 2022-12-09T09:56:14Z

This is valid python code(!) that may cause the above failure:

def 㒨㒨虁虁𤯒𤯒():
    pass

if __name__ == '__main__':
    㒨㒨虁虁𤯒𤯒()

alcarney · 2022-12-13T20:23:29Z

Unfortunately I don't know this part of the LSP spec well enough to understand why the correct result is one!

print(position_from_utf16(s, 2))

I assume this is simulating a client sending a position that is after the first emoji? Something like

😋|😋

where | represents the cursor. And that the correct result is 1 since in python strings this is the index after the first character?

If so... would the fix be to do something like

def position_from_utf16(line, pos):
    idx = max(0, pos-1)
    return pos - utf16_unit_offset(line[:idx])

so that the second emoji is not passed to the utf16_unit_offset function?

pyscripter · 2022-12-13T21:29:06Z

I assume this is simulating a client sending a position that is after the first emoji? Something like

Yes

If so... would the fix be to do something like

No that would not work

Something like:

def position_from_utf16(line, pos):
    l = len(line)
    i = p = 0
    while (i < l) and (p < pos):
        p += 2 if (ord(line[i]) > 0xFFFF) else 1
        i += 1
    return i

should work.

Fixes #302

tombh · 2022-12-14T19:19:50Z

Can you have a look at #304 please?

Fixes #302

tombh added a commit that referenced this issue Dec 14, 2022

fix: Correctly cast from UTF16 positions

2a05253

Fixes #302

tombh mentioned this issue Dec 14, 2022

fix: Correctly cast from UTF16 positions #304

Merged

8 tasks

tombh added a commit that referenced this issue Dec 21, 2022

fix: Correctly cast from UTF16 positions

69a5960

Fixes #302

tombh added a commit that referenced this issue Jan 27, 2023

fix: Correctly cast from UTF16 positions

94fe887

Fixes #302

tombh added a commit that referenced this issue Jun 2, 2023

fix: Correctly cast from UTF16 positions

63c0026

Fixes #302

tombh added a commit that referenced this issue Jun 2, 2023

fix: Correctly cast from UTF16 positions

3b36463

Fixes #302

tombh added a commit that referenced this issue Jun 9, 2023

fix: Correctly cast from UTF16 positions

138fda6

Fixes #302

tombh added a commit that referenced this issue Jun 9, 2023

fix: Correctly cast from UTF16 positions

d559282

Fixes #302

tombh added a commit that referenced this issue Jul 23, 2023

fix: Correctly cast from UTF16 positions

579db68

Fixes #302

tombh added a commit that referenced this issue Jul 26, 2023

fix: Correctly cast from UTF16 positions

b1edc98

Fixes #302

tombh closed this as completed in #304 Jul 28, 2023

tombh added a commit that referenced this issue Jul 28, 2023

fix: Correctly cast from UTF16 positions

d5a1212

Fixes #302

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

position_from_utf16 in workspace.py may return incorrect result. #302

position_from_utf16 in workspace.py may return incorrect result. #302

pyscripter commented Dec 8, 2022

pyscripter commented Dec 9, 2022 •

edited

Loading

alcarney commented Dec 13, 2022

pyscripter commented Dec 13, 2022 •

edited

Loading

tombh commented Dec 14, 2022

position_from_utf16 in workspace.py may return incorrect result. #302

position_from_utf16 in workspace.py may return incorrect result. #302

Comments

pyscripter commented Dec 8, 2022

pyscripter commented Dec 9, 2022 • edited Loading

alcarney commented Dec 13, 2022

pyscripter commented Dec 13, 2022 • edited Loading

tombh commented Dec 14, 2022

pyscripter commented Dec 9, 2022 •

edited

Loading

pyscripter commented Dec 13, 2022 •

edited

Loading