fix: Correctly cast from UTF16 positions #304

tombh · 2022-12-14T19:16:06Z

Fixes #302

Code review checklist (for code reviewer to complete)

Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
Title summarizes what is changing
Commit messages are meaningful (see this for details)
Tests have been included and/or updated, as appropriate
Docstrings have been included and/or updated, as appropriate
Standalone docs have been updated accordingly
CONTRIBUTORS.md was updated, as appropriate
Changelog has been updated, as needed (see CHANGELOG.md)

pyscripter · 2022-12-15T04:34:56Z

pygls/workspace.py

+        )
+        log.debug(f"position_from_utf16() line with {utf16_num_units(line)} UTF16 units: `{line}`")
+        position.character = utf16_num_units(line) - 1
+


position can be greater than utf16_num_units(line), if for example the cursor is at the end of the line.
I would just drop this whole if block not least because utf16_num_units slows things down.
_apply_incremental_change is designed to work correctly with positions beyond the end of the line.

i've kept this just to satisfy this test assert doc.offset_at_position(Position(line=1, character=5)) == 12, which seems to make sense to me

@pyscripter I need some help to understand why you consider that existing test line to be wrong. Do you disagree with this assertion assert doc.offset_at_position(Position(line=1, character=5)) == 12?

pyscripter · 2022-12-15T04:51:56Z

pygls/workspace.py

        )
-    except IndexError:
-        return Position(line=len(lines), character=0)
+        line = lines[-1]


There is no need to issue a waring.
_apply_incremental_change is designed to work correctly with positions beyond the end of the file.

I would just replace all the above with:

if position.line >= len(lines): return Position(len(lines), 0) # or return position line = lines[position.line]

pyscripter · 2022-12-15T04:53:12Z

pygls/workspace.py

+    _utf16_index = 0
+    _is_end_of_line = False
+    while True:
+        _current_char = line[conventional_index]


This would fail on empty lines

pyscripter · 2022-12-15T05:24:59Z

Please see the comments in the code.
Please note that position.character may well be >= len(line).
Also note that _apply_incremental_change is designed to work correctly with positions beyond the end of line or file.

I would just use (skipping the comments):

def position_from_utf16(lines: List[str], position: Position) -> Position:
    if position.line >= len(lines):
        # start of the line after last
        return Position(len(lines), 0)  # or just  return position

    line = lines[position.line]
    _utf32_len = len(line)
    _utf32_index = 0
    _utf16_index = 0
    while (_utf16_index < position.character) and (_utf32_index < _utf32_len):
        _current_char = line[_utf32_index]
        is_double_width = is_char_beyond_multilingual_plane(_current_char)
        if (is_double_width):
            _utf16_index += 2
        else:
            _utf16_index += 1
        _utf32_index += 1

    position = Position(
        line=position.line,
        character=_utf32_index
    )
    return position

This works correctly with empty strings as well.
lines = ['', '😋😋', '']
position -> result
(0, 0) -> (0, 0)
(0, 1) -> (0, 0)
(1, 0) -> (1, 0)
(1, 2) -> (1, 1)
(1, 3) -> (1, 2)
(1, 4) -> (1, 2)
(1, 100) -> (1, 2)
(3, 0) -> (3, 0)
(4, 10) -> (3, 0)

tombh · 2022-12-15T18:28:58Z

Thanks for the feedback and extra test cases, they're really useful. Before making these amends, can you have a look at Pygls' Document.offset_at_position(). It's the main reason I made those conditional guards at the beginning of the function. I don't know who uses offset_at_position(), or what it's for, but Pygls supports it, so we should probably continue to do so.

pyscripter · 2022-12-15T19:45:07Z

I don't know who uses offset_at_position()

Well, some editors may internally store the actual text of a file instead of a list of lines and they may want to convert from a position to an offset to the string representation of the file. Since offset_at_position does not account for line breaks, users of the function need to adjust for line breaks. Also if they store the file as say utf-8 instead of utf-32, then this function is of no use.

The function will work correctly if position_from_utf16 provides the correct result, so it should not be a concern.

tombh · 2022-12-21T16:57:02Z

Ok I've pushed some changes. I made the variables names a bit more verbose. I added a guard clause for when a queried Position goes over the length of the line. And I added all your tests cases.

How does it look to you?

tombh · 2023-01-27T19:26:12Z

I'm thinking of merging this as is...

pyscripter · 2023-01-27T19:33:40Z

_utf16_end_of_line = utf16_num_units(_line) - 1

This is definitely wrong. See my comments above.

tombh · 2023-06-01T16:34:20Z

Did you miss my question here? #304 (comment)

pyscripter · 2023-06-01T18:03:24Z

Did you miss my question here? #304 (comment)

I did miss it. But I think I explained it above. Anyway:

Say you have a line "abc" and ^ is the cursor"

"^abc" position 0
"a^bc" position 1
"ab^c" postion 2
"abc^" position 3

so _utf16_end_of_line = utf16_num_units(_line) and not _utf16_end_of_line = utf16_num_units(_line) - 1

Also, this code is used in _apply_incremental_change and if you look at that code, it is designed to work correctly with positions beyond the end of the line and the end of the file.

tombh · 2023-06-09T01:47:35Z

Just taking a small step back for a moment. This is surprisingly a lot more complicated than I first thought. We have 2 things going on here.

The first one, and the reason that this PR exists, is the position_from_utf16 bug you identified in position_from_utf16 in workspace.py may return incorrect result. #302. And I think the PR here addresses that?
The second one is the tangentially associated offset_at_position tests that fail because of the fix for 1. (We have some comments on that in a now stale commit.)

Copying your comments from point 2 here:

But why include the control line end characters in lines?
...
You also seem to assume that the control characters are always /n. This is not true on Windows or for the last line.

Very good questions. I had to look this up and you were indeed right to bring it up. If I'm interpreting it correctly the LSP spec says that that \n and \r\n should be considered and considered as the same length. The relevant section is here. And I've added a new test for that called test_position_for_line_endings.

To further complicate things, I also learnt from reading the spec that the position encodings can be negotiated by the client and server! See:

https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#clientCapabilities

It does say that it defaults to utf-16 positioning, which is what Pygls does already. So maybe we can just create an issue to acknowledge that Pygls doesn't currently support clients requesting specific position encodings.

tombh · 2023-07-23T23:43:40Z

I'm thinking of merging this then.

Fixes #302

tombh · 2023-07-26T13:38:40Z

I think it's time we merged this

alcarney

I've never managed to get my head around this issue, so if you say it's ready I'll take your word for it! 😄

tombh · 2023-07-28T16:34:00Z

After 8 months, it's as ready as it'll ever be I think!

tombh mentioned this pull request Dec 14, 2022

position_from_utf16 in workspace.py may return incorrect result. #302

Closed

pyscripter reviewed Dec 15, 2022

View reviewed changes

tombh force-pushed the utf8-position-bug branch from 2a05253 to 69a5960 Compare December 21, 2022 16:55

tombh force-pushed the utf8-position-bug branch from 69a5960 to 94fe887 Compare January 27, 2023 19:25

tombh requested a review from pyscripter January 27, 2023 19:28

tombh force-pushed the utf8-position-bug branch 3 times, most recently from 3b36463 to 138fda6 Compare June 9, 2023 01:47

tombh force-pushed the utf8-position-bug branch from 138fda6 to d559282 Compare June 9, 2023 01:53

This was referenced Jun 26, 2023

Add remaining features for 3.17 support #346

Closed

Add inlay hint support #342

Merged

tombh mentioned this pull request Jul 10, 2023

Renaming a module adds additional new lines pappasam/jedi-language-server#159

Open

tombh force-pushed the utf8-position-bug branch from d559282 to 579db68 Compare July 23, 2023 23:43

fix: Correctly cast from UTF16 positions

b1edc98

Fixes #302

tombh force-pushed the utf8-position-bug branch from 579db68 to b1edc98 Compare July 26, 2023 13:26

tombh requested review from alcarney and pyscripter and removed request for pyscripter July 26, 2023 13:38

alcarney approved these changes Jul 26, 2023

View reviewed changes

tombh merged commit d5a1212 into main Jul 28, 2023

tombh deleted the utf8-position-bug branch July 28, 2023 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correctly cast from UTF16 positions #304

fix: Correctly cast from UTF16 positions #304

tombh commented Dec 14, 2022 •

edited

Loading

pyscripter Dec 15, 2022 •

edited

Loading

tombh Dec 21, 2022

tombh Jan 27, 2023

pyscripter Dec 15, 2022 •

edited

Loading

pyscripter Dec 15, 2022

pyscripter commented Dec 15, 2022 •

edited

Loading

tombh commented Dec 15, 2022

pyscripter commented Dec 15, 2022 •

edited

Loading

tombh commented Dec 21, 2022

tombh commented Jan 27, 2023

pyscripter commented Jan 27, 2023

tombh commented Jun 1, 2023

pyscripter commented Jun 1, 2023 •

edited

Loading

tombh commented Jun 9, 2023

tombh commented Jul 23, 2023

tombh commented Jul 26, 2023

alcarney left a comment

tombh commented Jul 28, 2023

fix: Correctly cast from UTF16 positions #304

fix: Correctly cast from UTF16 positions #304

Conversation

tombh commented Dec 14, 2022 • edited Loading

Code review checklist (for code reviewer to complete)

pyscripter Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

tombh Dec 21, 2022

Choose a reason for hiding this comment

tombh Jan 27, 2023

Choose a reason for hiding this comment

pyscripter Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

pyscripter Dec 15, 2022

Choose a reason for hiding this comment

pyscripter commented Dec 15, 2022 • edited Loading

tombh commented Dec 15, 2022

pyscripter commented Dec 15, 2022 • edited Loading

tombh commented Dec 21, 2022

tombh commented Jan 27, 2023

pyscripter commented Jan 27, 2023

tombh commented Jun 1, 2023

pyscripter commented Jun 1, 2023 • edited Loading

tombh commented Jun 9, 2023

tombh commented Jul 23, 2023

tombh commented Jul 26, 2023

alcarney left a comment

Choose a reason for hiding this comment

tombh commented Jul 28, 2023

tombh commented Dec 14, 2022 •

edited

Loading

pyscripter Dec 15, 2022 •

edited

Loading

pyscripter Dec 15, 2022 •

edited

Loading

pyscripter commented Dec 15, 2022 •

edited

Loading

pyscripter commented Dec 15, 2022 •

edited

Loading

pyscripter commented Jun 1, 2023 •

edited

Loading