Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correctly cast from UTF16 positions #304

Merged
merged 1 commit into from
Jul 28, 2023
Merged

fix: Correctly cast from UTF16 positions #304

merged 1 commit into from
Jul 28, 2023

Conversation

tombh
Copy link
Collaborator

@tombh tombh commented Dec 14, 2022

Fixes #302

Code review checklist (for code reviewer to complete)

  • Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
  • Title summarizes what is changing
  • Commit messages are meaningful (see this for details)
  • Tests have been included and/or updated, as appropriate
  • Docstrings have been included and/or updated, as appropriate
  • Standalone docs have been updated accordingly
  • CONTRIBUTORS.md was updated, as appropriate
  • Changelog has been updated, as needed (see CHANGELOG.md)

)
log.debug(f"position_from_utf16() line with {utf16_num_units(line)} UTF16 units: `{line}`")
position.character = utf16_num_units(line) - 1

Copy link

@pyscripter pyscripter Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

position can be greater than utf16_num_units(line), if for example the cursor is at the end of the line.
I would just drop this whole if block not least because utf16_num_units slows things down.
_apply_incremental_change is designed to work correctly with positions beyond the end of the line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've kept this just to satisfy this test assert doc.offset_at_position(Position(line=1, character=5)) == 12, which seems to make sense to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyscripter I need some help to understand why you consider that existing test line to be wrong. Do you disagree with this assertion assert doc.offset_at_position(Position(line=1, character=5)) == 12?

)
except IndexError:
return Position(line=len(lines), character=0)
line = lines[-1]
Copy link

@pyscripter pyscripter Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to issue a waring.
_apply_incremental_change is designed to work correctly with positions beyond the end of the file.

I would just replace all the above with:

    if position.line >= len(lines):        
        return Position(len(lines), 0)  # or return position

    line = lines[position.line]

_utf16_index = 0
_is_end_of_line = False
while True:
_current_char = line[conventional_index]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail on empty lines

@pyscripter
Copy link

pyscripter commented Dec 15, 2022

Please see the comments in the code.
Please note that position.character may well be >= len(line).
Also note that _apply_incremental_change is designed to work correctly with positions beyond the end of line or file.

I would just use (skipping the comments):

def position_from_utf16(lines: List[str], position: Position) -> Position:
    if position.line >= len(lines):
        # start of the line after last
        return Position(len(lines), 0)  # or just  return position

    line = lines[position.line]
    _utf32_len = len(line)
    _utf32_index = 0
    _utf16_index = 0
    while (_utf16_index < position.character) and (_utf32_index < _utf32_len):
        _current_char = line[_utf32_index]
        is_double_width = is_char_beyond_multilingual_plane(_current_char)
        if (is_double_width):
            _utf16_index += 2
        else:
            _utf16_index += 1
        _utf32_index += 1

    position = Position(
        line=position.line,
        character=_utf32_index
    )
    return position

This works correctly with empty strings as well.
lines = ['', '😋😋', '']
position -> result
(0, 0) -> (0, 0)
(0, 1) -> (0, 0)
(1, 0) -> (1, 0)
(1, 2) -> (1, 1)
(1, 3) -> (1, 2)
(1, 4) -> (1, 2)
(1, 100) -> (1, 2)
(3, 0) -> (3, 0)
(4, 10) -> (3, 0)

@tombh
Copy link
Collaborator Author

tombh commented Dec 15, 2022

Thanks for the feedback and extra test cases, they're really useful. Before making these amends, can you have a look at Pygls' Document.offset_at_position(). It's the main reason I made those conditional guards at the beginning of the function. I don't know who uses offset_at_position(), or what it's for, but Pygls supports it, so we should probably continue to do so.

@pyscripter
Copy link

pyscripter commented Dec 15, 2022

I don't know who uses offset_at_position()

Well, some editors may internally store the actual text of a file instead of a list of lines and they may want to convert from a position to an offset to the string representation of the file. Since offset_at_position does not account for line breaks, users of the function need to adjust for line breaks. Also if they store the file as say utf-8 instead of utf-32, then this function is of no use.

The function will work correctly if position_from_utf16 provides the correct result, so it should not be a concern.

@tombh tombh force-pushed the utf8-position-bug branch from 2a05253 to 69a5960 Compare December 21, 2022 16:55
@tombh
Copy link
Collaborator Author

tombh commented Dec 21, 2022

Ok I've pushed some changes. I made the variables names a bit more verbose. I added a guard clause for when a queried Position goes over the length of the line. And I added all your tests cases.

How does it look to you?

@tombh tombh force-pushed the utf8-position-bug branch from 69a5960 to 94fe887 Compare January 27, 2023 19:25
@tombh
Copy link
Collaborator Author

tombh commented Jan 27, 2023

I'm thinking of merging this as is...

@tombh tombh requested a review from pyscripter January 27, 2023 19:28
@pyscripter
Copy link

_utf16_end_of_line = utf16_num_units(_line) - 1

This is definitely wrong. See my comments above.

@tombh
Copy link
Collaborator Author

tombh commented Jun 1, 2023

Did you miss my question here? #304 (comment)

@pyscripter
Copy link

pyscripter commented Jun 1, 2023

Did you miss my question here? #304 (comment)

I did miss it. But I think I explained it above. Anyway:

Say you have a line "abc" and ^ is the cursor"

"^abc" position 0
"a^bc" position 1
"ab^c" postion 2
"abc^" position 3

so _utf16_end_of_line = utf16_num_units(_line) and not _utf16_end_of_line = utf16_num_units(_line) - 1

Also, this code is used in _apply_incremental_change and if you look at that code, it is designed to work correctly with positions beyond the end of the line and the end of the file.

@tombh tombh force-pushed the utf8-position-bug branch 3 times, most recently from 3b36463 to 138fda6 Compare June 9, 2023 01:47
@tombh
Copy link
Collaborator Author

tombh commented Jun 9, 2023

Just taking a small step back for a moment. This is surprisingly a lot more complicated than I first thought. We have 2 things going on here.

  1. The first one, and the reason that this PR exists, is the position_from_utf16 bug you identified in position_from_utf16 in workspace.py may return incorrect result. #302. And I think the PR here addresses that?
  2. The second one is the tangentially associated offset_at_position tests that fail because of the fix for 1. (We have some comments on that in a now stale commit.)

Copying your comments from point 2 here:

But why include the control line end characters in lines?
...
You also seem to assume that the control characters are always /n. This is not true on Windows or for the last line.

Very good questions. I had to look this up and you were indeed right to bring it up. If I'm interpreting it correctly the LSP spec says that that \n and \r\n should be considered and considered as the same length. The relevant section is here. And I've added a new test for that called test_position_for_line_endings.

To further complicate things, I also learnt from reading the spec that the position encodings can be negotiated by the client and server! See:
image
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#clientCapabilities

It does say that it defaults to utf-16 positioning, which is what Pygls does already. So maybe we can just create an issue to acknowledge that Pygls doesn't currently support clients requesting specific position encodings.

@tombh
Copy link
Collaborator Author

tombh commented Jul 23, 2023

I'm thinking of merging this then.

@tombh tombh force-pushed the utf8-position-bug branch from 579db68 to b1edc98 Compare July 26, 2023 13:26
@tombh tombh requested review from alcarney and pyscripter and removed request for pyscripter July 26, 2023 13:38
@tombh
Copy link
Collaborator Author

tombh commented Jul 26, 2023

I think it's time we merged this

Copy link
Collaborator

@alcarney alcarney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never managed to get my head around this issue, so if you say it's ready I'll take your word for it! 😄

@tombh
Copy link
Collaborator Author

tombh commented Jul 28, 2023

After 8 months, it's as ready as it'll ever be I think!

@tombh tombh merged commit d5a1212 into main Jul 28, 2023
@tombh tombh deleted the utf8-position-bug branch July 28, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

position_from_utf16 in workspace.py may return incorrect result.
3 participants