Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow column position report in characters #45

Closed
xpol opened this issue Jan 5, 2016 · 12 comments
Closed

Add option to allow column position report in characters #45

xpol opened this issue Jan 5, 2016 · 12 comments

Comments

@xpol
Copy link
Contributor

xpol commented Jan 5, 2016

Currently column position are reported in bytes.

Eg:

local hand = {['']={}}
print(hand[''][i])

luacheck --ranges test.lua will reports follow warning:

spec/Hand_spec.lua:2:19-19: accessing undefined variable i

In atom with linter-luacheck displays:

image

The problems is has 3 bytes in utf8 which caused 2 character offset in Atom.

Is it possible to include an option like --characters in luacheck to report column position in characters rather than bytes? Maybe a related options such as --encoding is needed (or --characters default treat input as utf8 is enough).

@xpol xpol changed the title Should column position report in characters or bytes? Add option to allow column position report in characters Jan 5, 2016
@mpeterv
Copy link
Owner

mpeterv commented Jan 5, 2016

I don't really want to implement any encodings other than utf8, so I prefer --characters (maybe call it --utf8 then).

@xpol
Copy link
Contributor Author

xpol commented Jan 5, 2016

I personally don't use Lua file encoded other than utf8.

And I prefers --characters or --count-in-characters rather than --utf8 😄 .

@mpeterv
Copy link
Owner

mpeterv commented Feb 17, 2016

I think it's OK to just always use UTF8, without any options. Then this feature won't have strange interactions with caching, and can be implemented right away.

@mpeterv
Copy link
Owner

mpeterv commented Aug 27, 2017

There is also another use case: there is value in being able to check that input files are purely ASCII. E.g. it ensures that there are no comments in some other alphabet in codebase that's supposed to use purely English comments. So there should be an option to use ascii encoding. Then it makes sense to have --encoding option with possible values being:

  • latin: current behaviour, one byte = one character.
  • ascii: current behaviour for bytes in 0..127 range, error if there is a byte in 128..255 range
  • utf8: decode input as UTF8, report everything in characters, error on invalid byte sequence

An option to use UTF8 but ignore decoding errors does not seem to be very useful.

@johnd0e
Copy link

johnd0e commented Aug 30, 2017

An option to use UTF8 but ignore decoding errors does not seem to be very useful.

May be it would be usefull to get other option: try to use UTF8 but fall back to latin in case of errors.

Or another option: detect UTF8 by BOM.

It come in handy if we have to check mixed set of files.

@mpeterv
Copy link
Owner

mpeterv commented Aug 30, 2017

It will be possible to set encoding per file in config. Autodetection can be implemented separately as --encoding=auto, which may be then become default if it works well enough.

@Alloyed
Copy link

Alloyed commented Sep 4, 2017

I've got a particularly fun job with respect to this one.

Visual Studio Code, and then by extension anybody that wants to implement an LSP server, uses the word "character" to mean utf16 code units. I don't realistically expect you to implement this, but it means to match spec I need to go from bytes -> unicode code points -> utf16 code units.

I am doing this using the lua API, so I prefer working in bytes so the built-in string.sub() works. I don't mind an opt-in choice to use codepoints, but there are enough ambiguous definitions of the word "character" that I'd rather use the word code point in the docs.

@mpeterv
Copy link
Owner

mpeterv commented Sep 4, 2017

@Alloyed weird cases like that could be implemented as a separate encoding in luacheck, or maybe luacheck could load custom encodings from external modules, like with formatters.

And yes, ideally columns should be reported in grapheme clusters and not just code points. Grapheme clusters seem to be pretty clearly defined in the unicode spec.

@mpeterv
Copy link
Owner

mpeterv commented Sep 6, 2018

Just a bit late but on unicode branch columns/line length warnings are now in UTF8 codepoints (with fallback to old behaviour on decoding error).

@mpeterv
Copy link
Owner

mpeterv commented Sep 6, 2018

@Alloyed could you clarify the LSP issue? I'm not sure what Visual Studio Code ... uses the word "character" to mean utf16 code units means. UTF-16 encoded text can't be syntactically correct Lua.

@Alloyed
Copy link

Alloyed commented Sep 7, 2018

VSCode stores its text buffers internally as utf-16, even if the file that gets saved and loaded is saved as utf-8.

This means (by intention or by accident), that when the LSP protocol wants to communicate a position in a text file, it sends it as {line number, utf16 code unit}, even when the actual encoding of the text being sent back and forth is utf-8.

See
microsoft/language-server-protocol#376
for more info, and
https://github.com/Alloyed/lua-lsp/blob/master/lua-lsp/methods.lua#L292
for what this means in code.

@mpeterv
Copy link
Owner

mpeterv commented Sep 8, 2018

Unicode support merged into master, closing this, new release coming soon. @Alloyed if there are any issues with this please open a new one.

@mpeterv mpeterv closed this as completed Sep 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants