-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Location APIs for characters #1788
Comments
String#[] doesn't use grapheme clusters but codepoints (equivalent to Ruby's characters). I'm not sure about editors and debuggers, do they use byte offsets, codepoint offsets, "UTF-16 code unit" offsets, UTF-16 codepoints offsets or grapheme clusters offsets? Maybe @vinistock / @joeldrapper know? From a quick search I found microsoft/language-server-protocol#376 so it sounds like LSP used UTF-16 code units and now it's maybe possible to use utf-8 byte offsets and Unicode codepoints (same as Ruby characters) microsoft/language-server-protocol#376 (comment)? |
For editors, it depends on the editor's implementation. Language servers negotiate the encoding on the initial handshake, but the server can only select an encoding from the list of possible encodings informed by the client. For UTF-8 clients, then the concept of characters matches exactly what Ruby does. Regardless of how many bytes a unicode character takes, it will still be considered length 1. For UTF-16 clients based on NodeJS, the situation is a little bit different. Based on my investigations with @nirvdrum and the context on microsoft/vscode#20170, NodeJS has an optimization on how it calculates the length of strings. This optimization leads to a weird scenario, where unicode characters often have length 2, but regular characters have length 1. This is a known problem for language servers, where we need to perform offset adjustments if the client is UTF-16 (Rust Analyzer and the Ruby LSP use similar mechanisms of counting the UTF-16 length of the unicode character). So I guess in this case it would be considered a code unit?
Unless I misunderstood something, I don't believe this is the case. The client announces the supported encodings and then the server has the opportunity to pick one of them. But VS Code only supports UTF-16 for example. |
Yes, exactly, that's what UTF-16 code unit means, so a UTF-16 codepoint can be either 1 or 2 UTF-16 code units.
Right, client=VSCode and server=ruby-lsp, so ruby-lsp gets to choose except that currently VSCode the client only announces UTF-16, unfortunate. Rereading this comment it's clear UTF-16 is a mandatory encoding so anyway UTF-16 code units need to be handled in the server. Regarding UTF-16 code units I think that's outside the scope of Prism and something that makes more sense in the tools to deal with. For #1770, adding character/codepoint offsets seems convenient (if the other side cannot deal with byte offsets and need codepoint offsets, using byte offsets would be far more efficient as mentioned in my first comment). codepoint would be the clearer and best-defined term, character is quite ambiguous, OTOH Ruby String has each_char and chars (but also each_codepoint and codepoints). |
We store everything as an offset of bytes from the start of the source. This is correct and not going to change.
However lots of our consumers are dealing with encoded source strings and would like to be able to use the
[]
method without having to convert into # of characters. (Debatably, this should be called "grapheme_clusters", but I'm going with characters for now because it's shorter.)The APIs that would be nice to provide:
Location#start_character_offset
Location#end_character_offset
Location#start_character_column
Location#end_character_column
A simple implementation of these would be:
However, this would mean a lot of things being recalculated any time any of these APIs were called. My hypothesis is that if you're calling these APIs once, it's likely you'll call them again. In this case I think we should implement some caching on the
Source
object forbyte offset
=>character offset
and use that cache in these computations.The text was updated successfully, but these errors were encountered: