Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change character units from UTF-16 code unit to Unicode codepoint #376

Closed
MaskRay opened this issue Jan 13, 2018 · 181 comments
Closed

Change character units from UTF-16 code unit to Unicode codepoint #376

MaskRay opened this issue Jan 13, 2018 · 181 comments
Labels
breaking feature-request Request for new features or functionality help wanted Issues identified as good community contribution opportunities initialization
Milestone

Comments

@MaskRay
Copy link

MaskRay commented Jan 13, 2018

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
	range?: Range;
	rangeLength?: number;
	text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

jacobdufault/cquery#57

Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0

@MaskRay MaskRay changed the title Change character offset calculation from UTF-16 to UTF-8 Change character units from UTF-16 code unit to Unicode codepoint Jan 13, 2018
@szatanjl
Copy link

szatanjl commented Jan 18, 2018

I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint.

@dbaeumer
Copy link
Member

@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use utf-8 here but as the header indicates this can be change to any encoding assuming that the encoding is supported in all libraries (for example node per default as only a limited set on encodings).

The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways.

@dbaeumer dbaeumer added feature-request Request for new features or functionality breaking labels Jan 18, 2018
@dbaeumer dbaeumer added this to the Backlog milestone Jan 18, 2018
@Marwes
Copy link
Contributor

Marwes commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16.

@puremourning
Copy link

puremourning commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

Citation needed? ;)

Of the 7 downstream language completers we support in ycmd:

  • 1 uses byte offsets (libclang)
  • 6 use unicode code points (gocode, tern, tsserver*, jedi, racer, omnisharp*)
  • 0 use utf 16 code units

* full disclosure, I think these use code points, else we have a bug!

The last is a bit of a fib, because we're integrating Language Server API for java.

However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units.

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated.

Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p

@puremourning
Copy link

Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

@MaskRay
Copy link
Author

MaskRay commented Jan 19, 2018

Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints.

https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657

@vibhavp for Emacs lsp-mode internal representation

@szatanjl
Copy link

szatanjl commented Jan 19, 2018

I am sorry in advance if I am telling something stupid right now. I have a question to you guys.

My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets.

That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes?

Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems.

@dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints.

@eonil
Copy link

eonil commented Jan 22, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares.

This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever.

@kdvolder
Copy link

Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open.

The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable.

Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion).

@eonil
Copy link

eonil commented Jan 23, 2018

I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.

I didn't mention technical reasons because these are mainly repetition of other people's opinion or well known. Anyway, I list my technical reasons here.

  • IMO, UTF-8 is present and future, and UTF-16 is legacy to avoid. The reason is here.
  • By requiring dependency to UTF-16, LSP effectively forces program implementation to involve the legacy.
  • Simplicity is better than extra complexity and dependency. One encoding for everywhere is better.
  • More complexity and dependency increases amount of work of implementation a lot.
  • AFAIK, Converting indices between different Unicode encodings are very expensive.
  • LSP is a new protocol. No reason to involve a bad legacy. The only benefit here is potential benefit to specific platforms with native UTF-16 native strings.
  • For now, the only reason to require UTF-16 is to give such benefit to specific implementations.
  • Other platforms wouldn't be very happy due to increased complexity and potential performance penalty in implementation.
  • Such unfair benefit is likely going to break community.

... or needs to be a capability the client announces

I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to utf-8, interpret the numbers as UTF-8 code points, and if it is utf-16, interpret them as UTF-16 code point. If the field is missing, fallback to UTF-16 for legacy compatibility.

@sam-mccall
Copy link
Contributor

This is causing us some implementation difficulty in clangd, which needs to interop with external indexes.
Using UTF-16 on network protocols is rare, so requiring indexes to provide UTF-16 column numbers is a major yak-shave and breaks abstractions.

@udoprog
Copy link

udoprog commented Apr 13, 2018

Yup, same problem here working on reproto/reproto#34.

This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description.

@dbaeumer
Copy link
Member

As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:

  • client announces the encodings it supports for position encodings.
  • server picks an encoding to use.

If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients.

I would appreciate a PR for this that for example does the following:

@dbaeumer dbaeumer added the help wanted Issues identified as good community contribution opportunities label Apr 27, 2018
@jclc
Copy link

jclc commented May 13, 2018

What about using byte indices directly? Using codepoints still requires to go through every single character.

@udoprog
Copy link

udoprog commented May 13, 2018

@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice:

Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
In order to be backwards compatible the exchange has to be bi-directional. All servers have to support UTF-16 and fall back to this when the client indicates that this is their only capability. At least until a new major revision of the LSP has been widely adopted and the old one deprecated.

Using codepoints still requires to go through every single character.

This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches.

Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported.

@eonil
Copy link

eonil commented May 14, 2018

I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.

Anyway, LSP3 spec defines "character offset" as UTF-16 Code Unit, which means it implicitly defines term "character" as UTF-16 Code Unit as well. This is (1) non-sense as Unicode Code Unit is not intended to be a character and (2) inconsistent with other part where UTF-8 based.

In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs.


As far as I know, Unicode defines three considerable concepts of text assemblies.

  • Code Unit
  • Code Point
  • Grapheme Cluster

And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code.

As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion.

Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them.

There will be two distinctive position/offset counting mode (1) legacy-compatible and (2) grapheme-cluster-counting.

  • legacy-compatible mode is same with current. Defines "character" as UTF-16 Code Unit.
  • grapheme-cluster-counting mode defines "character" as "Grapheme Cluster" and uses count of Grapheme Cluster as position offset.

In LSP3, servers should support both of legacy-compatible(deprecated but default) and grapheme-cluster-counting modes.
In LSP4, grapheme-cluster-counting is the only counting method.


If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec.

@udoprog
Copy link

udoprog commented May 14, 2018

@eonil Regarding grapheme clusters:

The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience.

@eonil
Copy link

eonil commented May 14, 2018

@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used.

@hrsh7th
Copy link

hrsh7th commented Apr 1, 2022

Hi, thank you for your great work.

As one of the client implementers, I think it would be easier to implement by referencing it in the form of Position.encoding rather than advertising ServerCapability.

This is just one opinion, not a strong request.

@puremourning
Copy link

Unless I'm missing something, Position.encoding would be transferred on every message in a number of places (redundantly) and make it possible for it to differ for positions in the same mesasge, so I think (as another client author..) I'd prefer a capability and fixing it for the length of the LSP session.

@hgs3
Copy link

hgs3 commented Apr 1, 2022

Thank you for moving this issue forward.

I didn't limit this to UFT-8 so we can start experimienting with different solutions as well (e.g. code points) although there are not encodings (but I couldn't come up with a better name).

I'm glad to hear code points is still a consideration. You are correct they are not an encoding, but rather what is being encoded. As far as naming goes, this new property is basically quantifying the atomic unit of text during communication so a name that conveys that would work. Alternatively, you could keep the name positionEncoding but come up with a generic sounding value so it still makes sense, e.g. "unicode" or the like:

positionEncoding?: 'utf-8' | 'utf-16' | 'unicode' | string;

@dbaeumer
Copy link
Member

dbaeumer commented Apr 4, 2022

@hrsh7th as @puremourning correctly pointed out this makes it more complicated since the encoding can change from request to request. So, I am actually against it.

@michaelmesser
Copy link
Contributor

michaelmesser commented Apr 4, 2022

Aren't code points just UTF-32 code units? Why would we need another name?

@hrsh7th
Copy link

hrsh7th commented Apr 4, 2022

I think that actual clients often have the following functions.

function getRangeOfText (range: LSP.Range): string [] {
   ...
}

What I wanted to say was that passing the server information that created the Position to every implementation would be a lot of changes.
However, there is no doubt that the payload will be redundant. So it's not a strong opinion.

@dbaeumer
Copy link
Member

dbaeumer commented Apr 4, 2022

I will add utf-32 as well.

@hgs3
Copy link

hgs3 commented Apr 4, 2022

Aren't code points just UTF-32 code units? Why would we need another name?

@michaelmesser A Unicode code point is a 21-bit integer and UTF-32 is a standardization for using a 32-bit integer as its storage mechanism. If they wanted to, the Unicode Consortium could introduce UTF-24 (fixed width, 3 byte/24-bit encoding). Even today, without being formally standardized, a server could choose to use UTF-24. The "issue" is that if the language server spec now or in the future ever transmits a byte offset then it will be 4-byte aligned. If the server implements UTF-24 then its code point implementation would be 3-byte aligned. As long as the LSP maintainers guarantee never transmitting a byte offset under any circumstance now or in the future, then this mismatch won't ever occur and you're correct stating that the UTF-32 code unit offset would be equivalent to the code point offset.

@digama0
Copy link

digama0 commented Apr 4, 2022

3-byte aligned

That's not a thing. Alignment is a power of 2.

@michaelpj
Copy link
Contributor

Aren't code points just UTF-32 code units? Why would we need another name?

Because it's confusing and requires some smarts on the part of the server implementor.

A nice thing about code points is that it doesn't matter which unicode encoding you use for your document, you can always index with code points (perhaps inefficiently) without having to re-encode. So if I see that I'm being sent code points I know: great, I don't have to worry about what text encoding I've chosen.

On the other hand, if I see that I'm being sent UTF-32 code units I might think that I need to re-encode my document as UTF-32 in order to handle that position! This isn't true: you just interpret them as code points in your existing document and you're good to go, but it's confusing and isn't saying what you mean.

I would much prefer to have code points as an explicit option.

@nemethf
Copy link

nemethf commented Apr 9, 2022

@dbaeumer, would it be possible to extend the current text with a clarifying explanation of the different encodings? The messages above show it is not always easy to understand the connection between an encoding and a column position. Thank you.

This is what clangd has:

Well-known encodings are:

    utf-8: character counts bytes
    utf-16: character counts code units
    utf-32: character counts codepoints

@michaelpj
Copy link
Contributor

@nemethf I added some explanations in this PR: #1442

I also included a short note pointing out the coincidence between UTF-32 code units and code points, so even if we don't add code points as an explicit option, this may make the situation clearer.

@dbaeumer
Copy link
Member

Add to 3.17 which shipped today.

@eric-wieser
Copy link

Am I right in saying that the vscode client does not actually support anything other than UTF-16, even though the protocol can now carry this info? I was able to find

https://github.com/microsoft/vscode-languageserver-node/blob/8b3c375adc4fa69f63726ef58d911c8e27b261c5/client/src/common/client.ts#L1120-L1122

@soc
Copy link

soc commented May 18, 2022

@eric-wieser It's going to support it real fast as soon as enough servers decide to not do the UTF-16 dance.

@digama0
Copy link

digama0 commented May 18, 2022

There is a bit of a chicken and egg problem there though. Servers don't have motivation to implement UTF-8 support if it doesn't lead to any improvement for users because the client doesn't support it, and they can't test it anyway so it would be irresponsible to ship. Of course VS Code isn't the only client out there, but I generally expect it to be the first adopter when it comes to new LSP features since LSP and VS Code are developed by the same people, more or less.

There should probably be another issue opened to track implementation of UTF-8 support in VS Code.

@dbaeumer
Copy link
Member

@digama0 as I tried to point out the conversion should when possible be done were the file lives (see #376 (comment)). Doing this generically on the client will have some bad performance implication since it would require that the client reads the file to do the conversion. This could be many in the case of a reference result or when reporting many diagnostics.

@eric-wieser
Copy link

I'm a little confused here; It seems to me that vscode stores editor contents in codepoints internally, as

https://github.com/microsoft/vscode/blob/bd95a8d4516d590256eefecfce465a2f10e8f5e5/src/vs/workbench/browser/parts/editor/editorStatus.ts#L613-L615

seems to print out the correct column number for inputs in which the extension API would return the wrong position.

@michaelpj
Copy link
Contributor

Supporting code-points (i.e. UTF-32 code units) at least would be super, since that would at least give us a format that doesn't require anyone to do re-encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking feature-request Request for new features or functionality help wanted Issues identified as good community contribution opportunities initialization
Projects
None yet
Development

No branches or pull requests