Test: encoding auto detection #23322

bpasero · 2017-03-28T00:30:15Z

Windows - @gregvanl
Linux - @roblourens
Mac - @joaomoreno

We are now using the jschardet library to try to guess the encoding from the contents of a file. As a preparation for testing, try to get some text files in the supported encodings. There is a shift-jis file checked in that is known to work but other encodings would also be interesting.

From @joaomoreno

I've created this repository with several encoding samples:

https://github.com/joaomoreno/encodings

There are two scenarios to test:

From the new files.autoGuessEncoding setting:

verify that the files.encoding setting is still being used as long as files.autoGuessEncoding is not enabled
verify that encoding is detected from the file if possible once files.autoGuessEncoding is enabled and you try with files that use a specific encoding not set as configured workspace encoding (e.g. shift-jis)
verify that the configured encoding is used (or utf8 by default) if the detection is not returning any more specialized encoding (in particular, opening an ASCII file should show you utf8)

From the encoding selector picker:

we now guess the encoding from the currently active file as soon as you open the encoding picker (either when chosing "Reopen with encoding" or "Save with encoding")
verify that when the encoding is detected and it differs from the configured encoding, you see it showing up first
verify that you do not see the hint about the guessed encoding when the configured encoding matches the detected one (e.g. you set files.encoding to shift-jis)
verify you do not end up with duplicate encodings (guessed, and from the list)

The text was updated successfully, but these errors were encountered:

joaomoreno · 2017-03-28T13:00:16Z

I've created this repository with several encoding samples:

https://github.com/joaomoreno/encodings

joaomoreno · 2017-03-28T13:06:53Z

@bpasero Please put a complexity on your test plan items in the future, this is quite extensive. cc @isidorn

eamodio · 2017-03-28T13:22:55Z

Will these changes also address #21146? Or is there still more to be done to get this into the diff editor?

duanehutchins · 2017-03-28T16:14:01Z

There is still a persistent issue where VSCode will ignore the current file encoding in favor of the files.encoding setting or the guess when opening an existing file, even when the files.encoding setting isn't the right file encoding.

Example: I create a new shift-jis encoded file with corresponding characters. Save it with Shift-JS encoding. Close it. Open it. VSCode opens the file as UTF-8. It happens with other encodings as well, such as Latin-1 (ISO-8895-1). When I reopen an encoded file, it will sometimes pick the wrong encoding instead of using the existing encoding.

Here is a gif of the bug in action:

TXT Files refrenced: iso-8859-1.txt and shift-jis.txt

You can see how the files are encoded as Shift-JIS and ISO-8895-1, but when I reopen the files, they are opened as Windows 1252 and UTF-8 encoding respectively, and this breaks some characters. If I then save the file, the encoding is saved wrong and the characters will remain broken.

bpasero · 2017-03-28T20:32:10Z

@duanehutchins that is why it is called "guessing". there is no such thing as detection with 100% certainity. I suggest you report your samples to the library we are using: jschardet.

roblourens · 2017-03-28T21:43:46Z

Hopefully it isn't a problem that this doesn't apply to search. files.encoding is applied but ripgrep doesn't autodetect anything besides utf-16.

buzzzzer · 2017-03-29T06:28:38Z

1. Some jcharset сharset names do not correspond with vscode charset aliases

For example:
jcharset: IBM866
vscode: CP 866

test sample:
test_cp866.txt

2. Autodetection does not work on search

Fix will likely on searchworker.ts:

private readlinesAsync(filename: string, perLineCallback: (line: string, lineNumber: number) => void, options: ReadLinesOptions): TPromise<void> {
...
...
// Check for BOM offset
switch (mimeAndEncoding.encoding) {
	case UTF8:
		pos = i = bomLength(UTF8);
		options.encoding = UTF8;
		break;
	case UTF16be:
		pos = i = bomLength(UTF16be);
		options.encoding = UTF16be;
		break;
	case UTF16le:
		pos = i = bomLength(UTF16le);
		options.encoding = UTF16le;
		break;
// fix here
	default:
		if (mimeAndEncoding.encoding) {
			pos = i = 1;
			options.encoding = mimeAndEncoding.encoding;
		}
		break;
}
....

bpasero · 2017-03-29T13:18:54Z

@buzzzzer I fixed that case with CP866, if there are more cases, let me know.

Encoding detection is only for files, not for search right now.

buzzzzer · 2017-03-29T13:33:07Z

@bpasero

Encoding detection is only for files, not for search right now.

It's a pity.
Fix is simple, and it will take another year to wait (

bpasero · 2017-03-29T13:39:45Z

Sometimes simple things need years to get right.

roblourens · 2017-03-29T16:55:16Z

@buzzzzer Search is provided by ripgrep in 1.11, not searchWorker.ts, and ripgrep only does encoding autodetection for utf-8/16. You can still set files.encoding to search in those files.

bpasero · 2017-03-29T17:05:48Z

@roblourens so to be clear, ripgrep allows to set the encoding, but only for all files it goes through, not per particular file/folder?

roblourens · 2017-03-29T17:19:47Z

Yes, pre-ripgrep search worked the same way

buzzzzer · 2017-04-06T08:26:59Z

@joaomoreno

You can still set files.encoding to search in those files.

And what should set to files.encoding ?
If in one workspace i have 4 different encodings (cp866, windows1251, utf8 & utf16)

buzzzzer · 2017-04-11T05:43:07Z

@roblourens, @bpasero

Yes, pre-ripgrep search worked the same way

pre-ripgrep search (searchworker) with autodetect I have been using for about 4 months, and that WFM

Any planned to support autoguessencoding with ripgrep search?

pre-ripgrep will be retained or deleted in future versions?

roblourens · 2017-04-11T06:19:59Z

Encoding autodetect has only existed for less than a month, and doesn't apply to non-ripgrep search, so I'm not sure how that was any different for you.

I think it's unlikely that ripgrep would get fancier encoding detection since that will slow it down, and their focus is entirely on speed.

I'll keep it in the short term. If there is a usecase that just can't be handled by ripgrep, I might keep it longer.

buzzzzer · 2017-04-11T06:25:45Z

@roblourens

Encoding autodetect has only existed for less than a month, and doesn't apply to non-ripgrep search, so I'm not sure how that was any different for you.

I build vscode from the sources with modified (for search) #10013 every stable build.
Now I do not know how I should be after ripgrep integrate
Search in different encodings I really needed :(

phobos2077 · 2017-10-28T16:52:55Z

Auto-detection doesn't seem to work for "windows1251" encoding. It thinks that it is UTF-8...

bpasero added the testplan-item label Mar 28, 2017

bpasero added this to the March 2017 milestone Mar 28, 2017

bpasero assigned bpasero and unassigned bpasero Mar 28, 2017

bpasero mentioned this issue Mar 28, 2017

Automatically detect encoding of the file opened #5388

Closed

isidorn assigned joaomoreno, roblourens and gregvanl Mar 28, 2017

joaomoreno mentioned this issue Mar 28, 2017

Enable files.autoGuessEncoding by default #23417

Closed

joaomoreno mentioned this issue Mar 28, 2017

Have contextual notification in encoding status bar item #23428

Closed

joaomoreno removed their assignment Mar 28, 2017

gregvanl removed their assignment Mar 28, 2017

roblourens removed their assignment Mar 28, 2017

roblourens closed this as completed Mar 28, 2017

roblourens mentioned this issue Mar 28, 2017

Encoding autoguess - weird case #23508

Closed

bpasero added a commit that referenced this issue Mar 29, 2017

encoding normalization (reported in #23322)

56d3f11

vscodebot bot locked and limited conversation to collaborators Nov 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: encoding auto detection #23322

Test: encoding auto detection #23322

bpasero commented Mar 28, 2017 •

edited by roblourens

Loading

joaomoreno commented Mar 28, 2017

joaomoreno commented Mar 28, 2017 •

edited

Loading

eamodio commented Mar 28, 2017

duanehutchins commented Mar 28, 2017 •

edited

Loading

bpasero commented Mar 28, 2017 •

edited

Loading

roblourens commented Mar 28, 2017 •

edited

Loading

buzzzzer commented Mar 29, 2017 •

edited

Loading

bpasero commented Mar 29, 2017

buzzzzer commented Mar 29, 2017

bpasero commented Mar 29, 2017

roblourens commented Mar 29, 2017

bpasero commented Mar 29, 2017

roblourens commented Mar 29, 2017

buzzzzer commented Apr 6, 2017 •

edited

Loading

buzzzzer commented Apr 11, 2017 •

edited

Loading

roblourens commented Apr 11, 2017

buzzzzer commented Apr 11, 2017 •

edited

Loading

phobos2077 commented Oct 28, 2017

Test: encoding auto detection #23322

Test: encoding auto detection #23322

Comments

bpasero commented Mar 28, 2017 • edited by roblourens Loading

joaomoreno commented Mar 28, 2017

joaomoreno commented Mar 28, 2017 • edited Loading

eamodio commented Mar 28, 2017

duanehutchins commented Mar 28, 2017 • edited Loading

bpasero commented Mar 28, 2017 • edited Loading

roblourens commented Mar 28, 2017 • edited Loading

buzzzzer commented Mar 29, 2017 • edited Loading

bpasero commented Mar 29, 2017

buzzzzer commented Mar 29, 2017

bpasero commented Mar 29, 2017

roblourens commented Mar 29, 2017

bpasero commented Mar 29, 2017

roblourens commented Mar 29, 2017

buzzzzer commented Apr 6, 2017 • edited Loading

buzzzzer commented Apr 11, 2017 • edited Loading

roblourens commented Apr 11, 2017

buzzzzer commented Apr 11, 2017 • edited Loading

phobos2077 commented Oct 28, 2017

bpasero commented Mar 28, 2017 •

edited by roblourens

Loading

joaomoreno commented Mar 28, 2017 •

edited

Loading

duanehutchins commented Mar 28, 2017 •

edited

Loading

bpasero commented Mar 28, 2017 •

edited

Loading

roblourens commented Mar 28, 2017 •

edited

Loading

buzzzzer commented Mar 29, 2017 •

edited

Loading

buzzzzer commented Apr 6, 2017 •

edited

Loading

buzzzzer commented Apr 11, 2017 •

edited

Loading

buzzzzer commented Apr 11, 2017 •

edited

Loading