-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Correctly cast from UTF16 positions #304
Conversation
) | ||
log.debug(f"position_from_utf16() line with {utf16_num_units(line)} UTF16 units: `{line}`") | ||
position.character = utf16_num_units(line) - 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
position can be greater than utf16_num_units(line), if for example the cursor is at the end of the line.
I would just drop this whole if block not least because utf16_num_units slows things down.
_apply_incremental_change is designed to work correctly with positions beyond the end of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've kept this just to satisfy this test assert doc.offset_at_position(Position(line=1, character=5)) == 12
, which seems to make sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyscripter I need some help to understand why you consider that existing test line to be wrong. Do you disagree with this assertion assert doc.offset_at_position(Position(line=1, character=5)) == 12
?
pygls/workspace.py
Outdated
) | ||
except IndexError: | ||
return Position(line=len(lines), character=0) | ||
line = lines[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to issue a waring.
_apply_incremental_change is designed to work correctly with positions beyond the end of the file.
I would just replace all the above with:
if position.line >= len(lines):
return Position(len(lines), 0) # or return position
line = lines[position.line]
pygls/workspace.py
Outdated
_utf16_index = 0 | ||
_is_end_of_line = False | ||
while True: | ||
_current_char = line[conventional_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail on empty lines
Please see the comments in the code. I would just use (skipping the comments): def position_from_utf16(lines: List[str], position: Position) -> Position:
if position.line >= len(lines):
# start of the line after last
return Position(len(lines), 0) # or just return position
line = lines[position.line]
_utf32_len = len(line)
_utf32_index = 0
_utf16_index = 0
while (_utf16_index < position.character) and (_utf32_index < _utf32_len):
_current_char = line[_utf32_index]
is_double_width = is_char_beyond_multilingual_plane(_current_char)
if (is_double_width):
_utf16_index += 2
else:
_utf16_index += 1
_utf32_index += 1
position = Position(
line=position.line,
character=_utf32_index
)
return position This works correctly with empty strings as well. |
Thanks for the feedback and extra test cases, they're really useful. Before making these amends, can you have a look at Pygls' |
Well, some editors may internally store the actual text of a file instead of a list of lines and they may want to convert from a position to an offset to the string representation of the file. Since offset_at_position does not account for line breaks, users of the function need to adjust for line breaks. Also if they store the file as say utf-8 instead of utf-32, then this function is of no use. The function will work correctly if position_from_utf16 provides the correct result, so it should not be a concern. |
2a05253
to
69a5960
Compare
Ok I've pushed some changes. I made the variables names a bit more verbose. I added a guard clause for when a queried How does it look to you? |
69a5960
to
94fe887
Compare
I'm thinking of merging this as is... |
_utf16_end_of_line = utf16_num_units(_line) - 1 This is definitely wrong. See my comments above. |
Did you miss my question here? #304 (comment) |
I did miss it. But I think I explained it above. Anyway: Say you have a line "abc" and ^ is the cursor" "^abc" position 0 so Also, this code is used in _apply_incremental_change and if you look at that code, it is designed to work correctly with positions beyond the end of the line and the end of the file. |
3b36463
to
138fda6
Compare
Just taking a small step back for a moment. This is surprisingly a lot more complicated than I first thought. We have 2 things going on here.
Copying your comments from point 2 here:
Very good questions. I had to look this up and you were indeed right to bring it up. If I'm interpreting it correctly the LSP spec says that that To further complicate things, I also learnt from reading the spec that the position encodings can be negotiated by the client and server! See: It does say that it defaults to utf-16 positioning, which is what Pygls does already. So maybe we can just create an issue to acknowledge that Pygls doesn't currently support clients requesting specific position encodings. |
I'm thinking of merging this then. |
I think it's time we merged this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never managed to get my head around this issue, so if you say it's ready I'll take your word for it! 😄
After 8 months, it's as ready as it'll ever be I think! |
Fixes #302
Code review checklist (for code reviewer to complete)