Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test: encoding auto detection #23322

Closed
3 tasks done
bpasero opened this issue Mar 28, 2017 · 18 comments
Closed
3 tasks done

Test: encoding auto detection #23322

bpasero opened this issue Mar 28, 2017 · 18 comments

Comments

@bpasero
Copy link
Member

bpasero commented Mar 28, 2017

Refs: #5388

We are now using the jschardet library to try to guess the encoding from the contents of a file. As a preparation for testing, try to get some text files in the supported encodings. There is a shift-jis file checked in that is known to work but other encodings would also be interesting.


From @joaomoreno

I've created this repository with several encoding samples:

https://github.com/joaomoreno/encodings


There are two scenarios to test:

From the new files.autoGuessEncoding setting:

  • verify that the files.encoding setting is still being used as long as files.autoGuessEncoding is not enabled
  • verify that encoding is detected from the file if possible once files.autoGuessEncoding is enabled and you try with files that use a specific encoding not set as configured workspace encoding (e.g. shift-jis)
  • verify that the configured encoding is used (or utf8 by default) if the detection is not returning any more specialized encoding (in particular, opening an ASCII file should show you utf8)

From the encoding selector picker:

  • we now guess the encoding from the currently active file as soon as you open the encoding picker (either when chosing "Reopen with encoding" or "Save with encoding")
  • verify that when the encoding is detected and it differs from the configured encoding, you see it showing up first
  • verify that you do not see the hint about the guessed encoding when the configured encoding matches the detected one (e.g. you set files.encoding to shift-jis)
  • verify you do not end up with duplicate encodings (guessed, and from the list)

image

@joaomoreno
Copy link
Member

I've created this repository with several encoding samples:

https://github.com/joaomoreno/encodings

@joaomoreno
Copy link
Member

joaomoreno commented Mar 28, 2017

@bpasero Please put a complexity on your test plan items in the future, this is quite extensive. cc @isidorn

@eamodio
Copy link
Contributor

eamodio commented Mar 28, 2017

Will these changes also address #21146? Or is there still more to be done to get this into the diff editor?

@duanehutchins
Copy link

duanehutchins commented Mar 28, 2017

There is still a persistent issue where VSCode will ignore the current file encoding in favor of the files.encoding setting or the guess when opening an existing file, even when the files.encoding setting isn't the right file encoding.

Example: I create a new shift-jis encoded file with corresponding characters. Save it with Shift-JS encoding. Close it. Open it. VSCode opens the file as UTF-8. It happens with other encodings as well, such as Latin-1 (ISO-8895-1). When I reopen an encoded file, it will sometimes pick the wrong encoding instead of using the existing encoding.

Here is a gif of the bug in action:
vscode-fileencode
TXT Files refrenced: iso-8859-1.txt and shift-jis.txt

You can see how the files are encoded as Shift-JIS and ISO-8895-1, but when I reopen the files, they are opened as Windows 1252 and UTF-8 encoding respectively, and this breaks some characters. If I then save the file, the encoding is saved wrong and the characters will remain broken.

@gregvanl gregvanl removed their assignment Mar 28, 2017
@bpasero
Copy link
Member Author

bpasero commented Mar 28, 2017

@duanehutchins that is why it is called "guessing". there is no such thing as detection with 100% certainity. I suggest you report your samples to the library we are using: jschardet.

@roblourens roblourens removed their assignment Mar 28, 2017
@roblourens
Copy link
Member

roblourens commented Mar 28, 2017

Hopefully it isn't a problem that this doesn't apply to search. files.encoding is applied but ripgrep doesn't autodetect anything besides utf-16.

@buzzzzer
Copy link

buzzzzer commented Mar 29, 2017

1. Some jcharset сharset names do not correspond with vscode charset aliases

For example:
jcharset: IBM866
vscode: CP 866
bug

test sample:
test_cp866.txt

2. Autodetection does not work on search

Fix will likely on searchworker.ts:

private readlinesAsync(filename: string, perLineCallback: (line: string, lineNumber: number) => void, options: ReadLinesOptions): TPromise<void> {
...
...
// Check for BOM offset
switch (mimeAndEncoding.encoding) {
	case UTF8:
		pos = i = bomLength(UTF8);
		options.encoding = UTF8;
		break;
	case UTF16be:
		pos = i = bomLength(UTF16be);
		options.encoding = UTF16be;
		break;
	case UTF16le:
		pos = i = bomLength(UTF16le);
		options.encoding = UTF16le;
		break;
// fix here
	default:
		if (mimeAndEncoding.encoding) {
			pos = i = 1;
			options.encoding = mimeAndEncoding.encoding;
		}
		break;
}
....

@bpasero
Copy link
Member Author

bpasero commented Mar 29, 2017

@buzzzzer I fixed that case with CP866, if there are more cases, let me know.

Encoding detection is only for files, not for search right now.

@buzzzzer
Copy link

@bpasero

Encoding detection is only for files, not for search right now.

It's a pity.
Fix is simple, and it will take another year to wait (

@bpasero
Copy link
Member Author

bpasero commented Mar 29, 2017

Sometimes simple things need years to get right.

@roblourens
Copy link
Member

@buzzzzer Search is provided by ripgrep in 1.11, not searchWorker.ts, and ripgrep only does encoding autodetection for utf-8/16. You can still set files.encoding to search in those files.

@bpasero
Copy link
Member Author

bpasero commented Mar 29, 2017

@roblourens so to be clear, ripgrep allows to set the encoding, but only for all files it goes through, not per particular file/folder?

@roblourens
Copy link
Member

Yes, pre-ripgrep search worked the same way

@buzzzzer
Copy link

buzzzzer commented Apr 6, 2017

@joaomoreno

You can still set files.encoding to search in those files.

And what should set to files.encoding ?
If in one workspace i have 4 different encodings (cp866, windows1251, utf8 & utf16)

@buzzzzer
Copy link

buzzzzer commented Apr 11, 2017

@roblourens, @bpasero

Yes, pre-ripgrep search worked the same way

pre-ripgrep search (searchworker) with autodetect I have been using for about 4 months, and that WFM


Any planned to support autoguessencoding with ripgrep search?

pre-ripgrep will be retained or deleted in future versions?

@roblourens
Copy link
Member

Encoding autodetect has only existed for less than a month, and doesn't apply to non-ripgrep search, so I'm not sure how that was any different for you.

I think it's unlikely that ripgrep would get fancier encoding detection since that will slow it down, and their focus is entirely on speed.

I'll keep it in the short term. If there is a usecase that just can't be handled by ripgrep, I might keep it longer.

@buzzzzer
Copy link

buzzzzer commented Apr 11, 2017

@roblourens

Encoding autodetect has only existed for less than a month, and doesn't apply to non-ripgrep search, so I'm not sure how that was any different for you.

I build vscode from the sources with modified (for search) #10013 every stable build.
Now I do not know how I should be after ripgrep integrate
Search in different encodings I really needed :(

@phobos2077
Copy link

Auto-detection doesn't seem to work for "windows1251" encoding. It thinks that it is UTF-8...

@vscodebot vscodebot bot locked and limited conversation to collaborators Nov 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants