Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

position_from_utf16 in workspace.py may return incorrect result. #302

Closed
pyscripter opened this issue Dec 8, 2022 · 4 comments · Fixed by #304
Closed

position_from_utf16 in workspace.py may return incorrect result. #302

pyscripter opened this issue Dec 8, 2022 · 4 comments · Fixed by #304

Comments

@pyscripter
Copy link

The following code that mimics the way position_from_utf16 calculates the result, demonstrates the issue:

def utf16_unit_offset(chars: str):
    """Calculate the number of characters which need two utf-16 code units.
    Arguments:
        chars (str): The string to count occurrences of utf-16 code units for.
    """
    return sum(ord(ch) > 0xFFFF for ch in chars)

s = '😋😋'

def position_from_utf16(line, pos):
    return pos - utf16_unit_offset(line[:pos])

print(position_from_utf16(s, 2))

The result is zero, but it should be one.

Although, this may not have a large real-world impact (you don't get many high-plane chars in python code), it would be nice to have a correct calculation.

@pyscripter
Copy link
Author

pyscripter commented Dec 9, 2022

This is valid python code(!) that may cause the above failure:

def 㒨㒨虁虁𤯒𤯒():
    pass

if __name__ == '__main__':
    㒨㒨虁虁𤯒𤯒()

@alcarney
Copy link
Collaborator

Unfortunately I don't know this part of the LSP spec well enough to understand why the correct result is one!

print(position_from_utf16(s, 2))

I assume this is simulating a client sending a position that is after the first emoji? Something like

😋|😋

where | represents the cursor. And that the correct result is 1 since in python strings this is the index after the first character?

If so... would the fix be to do something like

def position_from_utf16(line, pos):
    idx = max(0, pos-1)
    return pos - utf16_unit_offset(line[:idx])

so that the second emoji is not passed to the utf16_unit_offset function?

@pyscripter
Copy link
Author

pyscripter commented Dec 13, 2022

I assume this is simulating a client sending a position that is after the first emoji? Something like

Yes

If so... would the fix be to do something like

No that would not work

Something like:

def position_from_utf16(line, pos):
    l = len(line)
    i = p = 0
    while (i < l) and (p < pos):
        p += 2 if (ord(line[i]) > 0xFFFF) else 1
        i += 1
    return i

should work.

@tombh
Copy link
Collaborator

tombh commented Dec 14, 2022

Can you have a look at #304 please?

tombh added a commit that referenced this issue Dec 21, 2022
tombh added a commit that referenced this issue Jan 27, 2023
tombh added a commit that referenced this issue Jun 2, 2023
tombh added a commit that referenced this issue Jun 2, 2023
tombh added a commit that referenced this issue Jun 9, 2023
tombh added a commit that referenced this issue Jun 9, 2023
tombh added a commit that referenced this issue Jul 23, 2023
tombh added a commit that referenced this issue Jul 26, 2023
tombh added a commit that referenced this issue Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants