Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC0020 Default File Encoding #71

Closed
JamesWTruher opened this issue Feb 22, 2017 · 53 comments
Closed

RFC0020 Default File Encoding #71

JamesWTruher opened this issue Feb 22, 2017 · 53 comments

Comments

@JamesWTruher
Copy link
Contributor

JamesWTruher commented Feb 22, 2017

Committe has voted and accepted this RFC: https://github.com/PowerShell/PowerShell-RFC/blob/master/2-Draft-Accepted/RFC0020-DefaultFileEncoding.md

Feedback for https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md

@smurawski
Copy link

I am so in favor of this @JamesWTruher! Dealing with BOM just makes life sad in a cross-platform world.

@iSazonov
Copy link
Contributor

If we use UTF8NoBOM for Linux then we should use UTF8 (=UTF8-With-BOM) for Windows.
So it seems the Legacy is unneeded.

And what will be the standard in future? If it is UTF8 as UTF8 w/o BOM we should interpret our UTF8 as UTF8 w/o BOM, add UTF8BOM and migrate Windows Powershell scripts to UTF8BOM. This can be a breaking change but justifiable change as bringing to the standard.

It would be nice to even consider a scenario with a target system when a Windows script works with files on another system - Linux - and vice versa.
Also interesting how it will affect writing portable scripts.

@rkeithhill Welcome to comment the RFC.

@rkeithhill
Copy link

At first glance I do like the idea of PS v6 on Windows defaulting to UTF8 (or UTF8NoBOM) but I think you would still need a way to set the old Legacy behavior for scripts. In fact, it would be nice if PowerShell could interpret #requires -Version <anything less than 6> as a request to set this variable to Legacy for the script. That said, this is a big change ... need time to think through the ramifications.

@joeyaiello
Copy link
Contributor

@iSazonov I've talked about this at length with the @PowerShell/powershell-committee and @JamesWTruher: we don't want to break existing scripts on the Windows side, hence the need for the Legacy value. As you can see from the RFC, the smattering of different default encodings across different cmdlets is virtually random, so the likelihood of breaking people's existing scripts where they've worked around that behavior is high.

Add onto to that the problem that the market still hasn't decided on a default encoding--MS and Windows still heavily uses UTF-16 and UTF-8 with BOMs, and Linux is almost entirely UTF8NoBOM. We thought about pushing for the latter as a new default, but Windows apps (e.g. Notepad) will default to rendering ASCII in the absence of a BOM, and it turns out that there's a not a 1:1 fallback from UTF-8 to ASCII.

@rkeithhill that's absolutely something we need to start solving soon rather than later. There's a ton of platform differences for which we should have some configurable mechanisms--e.g. file encodings, path delimiters, case sensitivity, line endings, etc.--and it's clear to me that we need some way to flip all of them between the Windows and Linux defaults. I'd rather not hide that all behind the #requires -Version as it's not immediately clear to the user what's going on, but we've floated the idea of "modes", an environment variable, a cmdlet or two (coupled maybe with some namespace for the $PS* variables), etc. If you don't see an RFC from me on the matter in the next week or two, yell at me. :)

@iSazonov
Copy link
Contributor

iSazonov commented Mar 3, 2017

@joeyaiello Thanks for clarify! I agree with Legacy.

I have more questions.

  1. Why do we say only about "file create" cmdlets? If I set $PSDefaultFileEncoding ="UTF32" I expect that is default for Get-Content too not only for Set-Content because the variable name is not "DefaultEncodingForFileCreation".
  2. "UTF8" and "Unicode" can confuse Unix users. Perhaps it make sense add "UTF8BOM" and "UTF16" in FileSystemCmdletProviderEncoding. Moreover Windows write BOM for all Unicode files so we should have UTF8BOM/UTF8NoBOM, UTF16BOM/UTF16NoBOM, UTF32BOM/UTF32NoBOM, UTF7BOM/UTF7NoBOM.
  3. For discussion. I thought about portable scripts: if I use Set-Content -Encoding "UTF8" on Windows then UTF8 is interpreted as UTF8BOM, if I use Set-Content -Encoding "UTF8" on Unix then UTF8 is interpreted as UTF8NoBOM. Such script will create a "correct" file on both platforms.

@DerpMcDerp
Copy link

I think for this proposal to be implemented correctly, PowerShell should first offer a way to treat existing enums as subsets of one another e.g.

enum encoding {
    Unknown
    String
    Unicode
    BigEndianUnicode
    UTF8
    UTF7
    UTF32
    Ascii
    Default
    Oem
    BigEndianUTF32
}

enum fsencoding : encoding {
	Byte
}

should be as if you did:

enum fsencoding : [enum]::GetUnderlyingType([encoding]) {
// [encoding] members pasted in:
	Unknown
	String
	...
	Oem
	BigEndianUTF32
// and now [fsencoding] members:
	Byte
}

# And the following should also work:
[fsencoding]$foo = [encoding]::Ascii

The reason is that:

  1. it only makes sense for a subset of cmdlets which take -Encoding to accept Byte
  2. The Unknown option doesn't make sense to be assgined to $PSDefaultFileEncoding

@joeyaiello
Copy link
Contributor

@iSazonov:

  1. I believe that for Get/Set-Content, we try to be intelligent by reading the BOM and using the existing encoding. If we were to allow $PSDefaultFileEncoding to override that behavior, you'd end up with a bunch of file encodings getting unintentionally changed (or worse, you'd end up with mixed encodings in the same file).
  2. Unicode as UTF-16 is a holdover from a choice (in my opinion, a mistake) that .NET made at some point in the System.Text.Encoding class. We certainly shouldn't remove Unicode, but I think it's worth considering adding UTF-16 for discoverability purposes (even though it would do the same thing as Unicode. As for other encodings without a BOM, they don't really exist out in the wild. With UTF-16, you need to have a BOM to know whether your characters are big or little endian. (Side note, I highly recommend this article on Unicode, and I still refer to this chart on Wikipedia all the time.)
  3. I understand where you're going, but I think it's more important that explicitly specifying a parameter is deterministic no matter which platform you're on. For those who don't specify a parameter, we should try and "do the right thing" based on platform (which I believe this RFC will address), but if you specify a parameter, that behavior should be well-known and not try to do any black magic behind the scenes. (Furthermore, UTF-8, by definition, has a BOM even on Linux. It's just that most utilities on Linux write UTF-8 without a BOM as a matter of practice, so the enum values here are still accurate.)

@DerpMcDerp: that would be useful to have, but given that this is implemented in C#, is that actually necessary to implement for this RFC (I'm not a C# wiz, but my gut tells me no).

@iSazonov
Copy link
Contributor

iSazonov commented Mar 4, 2017

@joeyaiello

we try to be intelligent by reading the BOM and using the existing encoding.

Yes. I believe we use $PSDefaultFileEncoding to set defaults but not override the current behavior ( remove " to be intelligent by reading the BOM"). If so we should set defaults in File provider GetContentReader too.

As for other encodings without a BOM, they don't really exist out in the wild

I used the article:

A full featured character encoding converter will have to provide the following 13 encoding variants of Unicode and UCS:
UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE

From Unicode FAQ:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?
A: UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used. [AF]

From the same FAQ:

Q: Which of the UTFs do I need to support?
A: UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing. [AF]

As non-English user I often do re-encoding in my scripts. Sometimes it's a real headache. Definitely we should get rid of diversity using encodings. So I would prefer that PoweShell Core used Unicode everywhere.

Many programs are sensitive to system/session locale. I believe we must remove the regression with Default and OEM. All the more so because Linux is also based on locale.

Based on this it would be great if PowerShell Core use UTF16 (WithBOM) as default on Windows with Legacy option for backward compatibility with Windows PowerShell and UTF8 (NoBOM) as default on Unix.

@DerpMcDerp
Copy link

I just skimmed at how PowerShell currently handles encoding and it's an inconsistent clutter. Here is an approximation of how PowerShell currently works if PowerShell used explicit enums instead of [ValidateSet()] everywhere.

// we need the following existing enum
public enum FileSystemCmdletProviderEncoding {
	Unknown,
	String,
	Unicode,
	Byte,
	BigEndianUnicode,
	UTF8,
	UTF7,
	UTF32,
	Ascii,
	Default,
	Oem,
	BigEndianUTF32
}

// and the following 2 synthesized enums:
enum FakeEncodingEnum1 {
	Unicode,
	BigEndianUnicode,
	UTF8,
	UTF7,
	UTF32,
	Ascii,
	Default,
	Oem
}

enum FakeEncodingEnum2 {
	Unknown,
	String,
	Unicode,
	BigEndianUnicode,
	UTF8,
	UTF7,
	UTF32,
	Ascii,
	Default,
	Oem
}

And then here is how the existing cmdlets operate:

Cmdlet -Encoding Notes
Add-Content FileSystemCmdletProviderEncoding BigEndianUnicode doesn't use a BOM. Invalid encodings default to Unicode
Export-Clixml FakeEncodingEnum1 -Encoding defaults to Unicode. BigEndianUnicode uses a BOM. Invalid encodings throw an exception
Export-Csv FakeEncodingEnum1 With -Append: $false, -Encoding defaults to ASCII. When -Append: $true, -Encoding defaults to encoding detection backing off to UTF8NoBOM if no preamble detected. BigEndianUnicode uses a BOM
Export-PSSession FakeEncodingEnum1 -Encoding defaults to UTF8. BigEndianUnicode uses a BOM
Get-Content FileSystemCmdletProviderEncoding BigEndianUnicode doesn't use a BOM. Invalid encodings default to Unicode
Import-Csv FakeEncodingEnum1 -Encoding defaults to encoding detection backing off to UTF8NoBOM if no preamble detected. BigEndianUnicode uses a BOM
Out-File FakeEncodingEnum2 -Encoding defaults to Unicode. BigEndianUnicode uses a BOM
Select-String FakeEncodingEnum1 -Encoding defaults to UTF8. BigEndianUnicode uses a BOM
Send-MailMessage FakeEncodingEnum2 -Encoding defaults to ASCII. BigEndianUnicode uses a BOM
Set-Content FileSystemCmdletProviderEncoding BigEndianUnicode doesn't use a BOM. Invalid encodings default to Unicode
  1. The cmdlets that take FileSystemCmdletProviderEncoding use Default if String is specified. Other cmdlets use Unicode if String is specified somehow.

  2. BigEndianUnicode is cmdlet dependant on whether or not it write a BOM.

  3. String and Unknown really should be gotten rid of. If they can't then just make them aliases for Unicode to simplify.

  4. Also, it looks like PathUtils.MasterStreamOpen could be simplified to look like:

enum FakeEncodingEnum3 {
	Unknown,
	String,
	Unicode,
	BigEndianUnicode,
	UTF8,
	UTF7,
	UTF32,
	Ascii,
	Default,
	Oem,
	UTF8NoBom
}

MasterStreamOpen(
	PSCmdlet cmdlet,
	string filePath,
//	string encoding,
//	bool defaultEncoding,
	FakeEncodingEnum3 encoding, /* new */
	bool Append,
	bool Force,
	bool NoClobber,
	out FileStream fileStream,
	out StreamWriter streamWriter,
	out FileInfo readOnlyFileInfo,
	bool isLiteralPath)

@iSazonov
Copy link
Contributor

iSazonov commented Mar 13, 2017

We should pay attention to what Default and OEM do not work exactly as in Windows PowerShell. See PowerShell/PowerShell#3248

@iSazonov
Copy link
Contributor

It should be noted that CoreFX use UTF8NoBOM as default for Streams.

@mklement0
Copy link
Contributor

mklement0 commented Mar 16, 2017

@DerpMcDerp: Great sleuthing. Two points (not sure if I'm misinterpreting your table):

  • Set-Content and Add-Content (if the file doesn't already exist) do create a BOM with -Encoding BigEndianUnicode - using any of the standard Unicode encoding schemes explicitly (with the exception of the nonstandard UTF-7 encoding) results in a BOM.

    • (Set-Content -NoNewline -Value '' -Encoding BigEndianUnicode t.txt); (Get-Item t.txt).Length yields 2, the length of the BOM in bytes.
  • From what I can tell, all standard cmdlets that read text - including Get-Content - only recognize BigEndianUnicode encoding with a BOM.

    • (In line with your table) BOM-less files are blindly interpreted as either "ANSI"-encoded (Get-Content, Import-PowerShellDataFile) or UTF-8-encoded (Import-Csv, Import-CliXml, Select-String).

@rkeithhill
Copy link

There are already significant breaking changes between PS v5 and PS Core v6 on Windows. Why not take this chance bring some consistency to the default file encoding on Windows? And since you are giving us the ability to set $PSDefaultFileEncoding (you are, aren't you?), I will change it to UTF8. That means other folks' scripts could break on my system. So I need some sort of mechanism in which to run these script in Legacy mode which is easy:

& { $PSDefaultFileEncoding='Legacy'; ./old-script.ps1 }

Or maybe I pester the script author to set the $PSDefaultFileEncoding to Legacy for their script. So this can be fixed within and without.

Perhaps this could be reason to have a .ps2 file? PS1 files would run with Legacy file encoding on Windows and UT8NoBOM on Linux. PS2 files would run with UTF8 on Windows and UTF8NoBOM on Linux. Are there other PS Core changes that could be handled by .PS2? Maybe PSSA could recognize PS2 files as explicitly portable and warn about use of COM, WMI and other non-portable features?

@iSazonov
Copy link
Contributor

It's a "ps2" beautiful idea. But it will make our code too complex to support (practically each subsystem will ensure compatibility mode: ex., for parameter bindings it will be a big headache).

I think it unnecessarily support "ps2" because we can directly use Windows PowerShell for old scripts and PowerShell Core for new scripts and we already have #Requires -Version. I believe it is much easier than support "ps2". Therefore, I also agree with:

There are already significant breaking changes between PS v5 and PS Core v6 on Windows. Why not take this chance bring some consistency to the default file encoding on Windows?

@dahlbyk
Copy link

dahlbyk commented Mar 21, 2017

From a Linux-aware developer standpoint, I have long been frustrated by PowerShell's handling of encoding (for > in particular), so I am thrilled to find this discussion. To clarify PowerShell/PowerShell#707 (comment) now that I have been directed to the appropriate place for encoding commentary...

$PSDefaultFileEncoding

👍 If the only change to come from this RFC is the ability to change > encoding, I would still be thrilled.

We should take this opportunity to rationalize our use of the Encoding parameter...

👍 Even better.

Naturally, specific use of the -encoding parameter when invoking the cmdlet shall override $PSDefaultFileEncoding.

👍 I love the thought of my functions with -Encoding just setting $PSDefaultFileEncoding = $Encoding and getting consistency for free.

When $PSDefaultFileEncoding is set to Legacy...the irregular file encoding on non-Windows platforms [is persisted]..."

👍 Important to keep this as an option for legacy scripts.

The default on Windows systems shall remain unchanged (the value for $PSDefaultFileEncoding shall be set to Legacy)...

👎 I'm with @rkeithhill: it's already unreasonable to expect v5 scripts to work unaltered with Core v6. Make a big deal about encoding on the migration checklist, and use this as an opportunity to educate people about the perils of not using Unicode.

And since you are giving us the ability to set $PSDefaultFileEncoding (you are, aren't you?), I will change it to UTF8. That means other folks' scripts could break on my system.

👍 More generally, this is a chance to educate people about the perils of not being explicit about file encoding. Once this is supported, scripts that expect files to be written with a specific encoding will need to be explicit regardless of system defaults.

Since scripts targeting v5 are expected to require alteration, and since scripts targeting Core are reasonably expected to set $PSDefaultFileEncoding if they need to be deterministic, the choice of default encoding seems relatively unconstrained. Especially with WSL, I vote for aligning with Linux. And really, everything gets easier if you can assume any BOM-less byte stream of text is UTF-8. That's not a fair assumption today on Windows, but we'll never get to the point where you can make any reasonable assumption about text encoding without taking a stand at some point. .NET Core and PS Core seem like great places to start.

@mklement0
Copy link
Contributor

@dahlbyk: All good points.

As an aside: In PSv5.1+ you now can change the encoding of > / >>, but you're limited to the encodings that Out-File -Encoding supports, which notably does not include "BOM-less" UTF-8 (yet):

# PSv5.1+: Make Out-File/>/>> create UTF-8 files with BOM by default:
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8' 

@joeyaiello
Copy link
Contributor

All great points, thanks for the feedback everyone.

First, we really don't want to go down the ps2 path. It was discussed internally here as far as back 2015, and given how much will be totally compatible between Windows PowerShell and PowerShell Core, we want people to err on the side of just trying stuff. Furthermore, as soon as we open the "throw compatibility to the winds" floodgates, I'm positive that we would break a ton of stuff, and that's not the intent of PowerShell 6. Think of it more like a .NET Standard/Core 2.0 strategy than a .NET Core 1.0 strategy.

That being said, we know that we need some higher-level mechanism to encapsulate and expose the differences in defaults on *nix and Windows. Using the encoding issue as a strawman for some of the other usability defaults (e.g. aliases, path delimiters, etc.), we want to some how make these things modifiable in one fell swoop so that, for instance, a script, module, or user's profile could put itself in "Linux mode" on a Windows machine. I owe an RFC on this.

With regards to encoding itself, I'm glad to see general consensus that this is the right approach. Given the level of feedback I've heard on "change the Windows defaults", I think it's something we need to seriously consider. I'd challenge those in favor of changing the Windows defaults to come up with some negative cases that invalidate that opinion--in other words, what's the worst possible thing that could happen to an existing script/user/workload if we make all versions of PowerShell 6 do UTF8NoBOM by default?

@dahlbyk
Copy link

dahlbyk commented Mar 22, 2017

what's the worst possible thing that could happen to an existing script/user/workload if we make all versions of PowerShell 6 do UTF8NoBOM by default?

The three risks that immediately come to mind are:

  1. Locales that default to a multi-byte encoding. Do they all use a BOM? Ran into an issue related to this here.
  2. Interop with cmd.exe. I vaguely know of chcp, but have never had to use it and don't know to what extent it impacts PowerShell now, if at all.
  3. Legacy apps. As I alluded to earlier, it's still often incorrect on Windows to assume text without a BOM is UTF-8. If UTF-16 is not supported, text looks like nonsense; treating Windows-1252 as UTF-8 has more subtle ​defects. I'm unsure what scenarios exist that would actually cause data loss from round-tripping non–UTF-8 through a UTF-8 decode/encode cycle.

@iSazonov
Copy link
Contributor

@iSazonov
Copy link
Contributor

iSazonov commented Mar 23, 2017

what's the worst possible thing that could happen to an existing script/user/workload if we make all versions of PowerShell 6 do UTF8NoBOM by default?

I create new UTF8NoBOM file in VS Code with Latin and Russian characters. Open it in Notepad, modify, save - all works well. I tested that files in UTF8NoBOM Excel 2013 loads and parses with no problems. It seems native Windows apps understand UTF8NoBOM (only still use defaults to UTF16LE). Tested on Windows 10.
I have scripts using Default and OEM encodings because of communications with legacy apps (old Oracle). If we shall fix Default and OEM (as in Windows PowerShell) the scripts will work well on PowerShell Core.
Don't see a problems yet.

(Current Windows behavior is: Read files with auto detect of encoding (and UTF16LE) default and create new files with UTF16LE default.)

By the way the .Net team has already had to go through this stage and had a conclusion on this issue because they've changed some defaults on UTF-8. If it was internal it would be great to publish this conclusion, motivation and reasoning.

@mklement0
Copy link
Contributor

mklement0 commented Mar 23, 2017

@iSazonov:

Open it in Notepad, modify, save - all works well

Actually, on saving, Notepad blindly prepends a UTF-8 BOM, even if the file had none (tried on Windows 7 and Windows 10). In other words: it correctly detects BOM-less UTF-8 and preserves UTF-8 coding on saving in principle, but always prepends a BOM.

Excel's behavior is wildly inconsistent (tested on 2007 and 2016): BOM-less UTF-8 in *.txt or *.csv files is correctly detected when you interactively open a file (by the wizard UI); by contrast, passing a file from the command line defaults to "ANSI" interpretation. On saving such a file, the encoding reverts to "ANSI". Additionally, Excel cannot handle big-endian Unicode files even with a BOM, and for little-endian UTF-32 with BOM reverts to UTF-16 LE with BOM on saving.

create new files with UTF16LE default.

Notepad creates "ANSI"-encoded files by default - at least on my Windows 7 and Windows 10 machines.
If you paste characters into a new document that cannot be represented in the active "ANSI" code page, Notepad warns you that information will be lost when you try to save.

Excel creates "ANSI" *.txt and *.csv files by default, and only creates UTF16LE *.txt files if you explicitly save in format "Unicode text"; for CSV files there is no such option, but in Excel 2016 you can save with UTF-8-with-BOM encoding ("CSV UTF-8 (Comma delimited)") - no such luck in 2007.

In short: there is no consistency in the world of Windows applications, at least with respect to these two frequently used applications.

they've changed some defaults on UTF-8

The .NET Framework's true default behavior (which also applies to .NET Core) - what encoding is used if you do not specify one explicitly - both for reading and writing - has always been BOM-less UTF-8, at least for the [System.IO.File] type.
On reading, BOMs are detected (note that the docs mistakenly only mention UTF-8 and UTF-32 BOMs, but UTF-16 BOMs are detected too), but a BOM-less file is blindly interpreted as UTF-8, so reading an "ANSI" file (or any other BOM-less encoding other than UTF-8, for that matter) breaks.

This is also how it works on Unix with the now-ubiquitous UTF-8-based locales, although there you don't get the benefit of BOM detection - the only thing the Unix utilities natively understand is BOM-less UTF-8.

Trying to emulate default Windows behavior on Unix is problematic, but without it, any existing scripts that use "ANSI" encoding themselves (to encode the source code) - which is quite likely - are potentially misread by PowerShell Core itself - which is annoying in comments, but the real problem arises for scripts / module manifests (data files) that have data with non-ASCII characters, such as identifiers (function names, ...), string literals, and DATA sections.

Note that this is already a problem: popular cross-platform editors such as VSCode, Atom, and Sublime Text all default to BOM-less UTF-8, so if you create your *.ps1 files with them and use non-ASCII identifiers or data, things break, because PowerShell blindly interprets BOM-less files as "default"-encoded, which means:

  • in Windows PowerShell: "ANSI"-encoded (based on the active code page implied by the legacy system locale)
  • in PowerShell Core, currently: ISO-8559-1, which not only means that the active "ANSI" code page is not respected, but that the following characters in identifiers / data in source code break even scripts that use the most widely used "ANSI" code page, Windows-1252: € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ - note the presence of the symbol.

The question is: what is the expectation with respect to using Windows-originated source-code files on Unix? Are they expected to work as-is, without re-encoding?
If so, then emulating current Windows behavior is a must.

@iSazonov
Copy link
Contributor

@mklement0 I update my post (tested on Windows 10 and Excel 2013).

@iSazonov
Copy link
Contributor

Trying to emulate default Windows behavior on Unix is problematic, but without it, any existing scripts that use "ANSI" (or OEM) encoding themselves (to encode the source code) - which is quite likely - are misread by PowerShell itself - which is annoying in comments, but the real problem arises for scripts / module manifests (data files) that have data with non-ASCII characters, such as identifiers (function names, ...), string literals, and DATA sections.

I believe it is a subject for ScriptAnalizer. Also we would add warnings in Windows PowerShell and maybe in PowerShell Core.

@iSazonov
Copy link
Contributor

I opened issue-question in .Net Core repo and get great comments about defaults dotnet/standard#260 (comment)

@iSazonov
Copy link
Contributor

iSazonov commented Apr 5, 2017

The following heuristics may be useful.
We can mount NTFS volumes on Unix and EXT volumes on Windows.
The following heuristics may be useful.
Although NTFS file system is case-sensitive W32k API makes it case-insensitive.
So our expectations always for volumes:

NTFS/SMB

  • case-Insensitive
  • Unicode (UTF16LE with BOM)

EXT

  • case-Sensitive
  • UTF8NoBOM

Sample scenario: An Unix user mount SMB share and generate files for Windows users.

@mklement0
Copy link
Contributor

@iSazonov:

Basing the case-sensitivity on the underlying filesystem sounds like a great idea; note that the macOS filesystem, HFS+, is by default also case-insensitive.

By contrast, deciding what default encoding to use based on the filesystem strikes me as problematic.
In general, in the interest of predictability, I would avoid any heuristics (I wouldn't consider respecting the filesystem's case-sensitivity a heuristic).

@TSlivede
Copy link

Hi, I know I'm very late (Comments Due 4/16/2017). I hope, it's not too impolite if I still comment on this RFC...

I absolutely like the Idea to make the -Encoding parameter consistent across cmdlets.
However, could it be somehow possible, to allow arguments of type System.Text.Encoding in addition to Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding?

This way, it would be easy to use custom or unusual encodings and the user is not restricted to the encodings in Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding.

@iSazonov
Copy link
Contributor

@TSlivede The discussion is open and very important - welcome!

@iSazonov
Copy link
Contributor

I want to make an interim summary based on the discussion and other discussions over the past six months.

There is no clear-cut solution.
Users must be able to change settings at - globally, a session, a remote session, a cmdlet levels.

Settings is:

  1. Case-sensitivity for paths and file names.
  2. On/Off BOM
    2.1 On/Off auto-detect for read (explicit BOM-noBOM)
    2.2 On/Off BOM for output
  3. On/Off byte stream

We should support backward-compatibility settings and modern settings.

All file systems have the same capabilities. Different flavors are entered at a more higher level OS API. So we should be OS-based not filesystem-based.

We should support sets of encodings (maybe as bitwise options):

  1. Backward-compatibility set encodings (FileSystemCmdletProviderEncoding)
  2. Modern set of encodings
  3. Full set System.Text.Encoding encodings

@masaeedu
Copy link

masaeedu commented Jun 5, 2017

I think the best file encoding is no file encoding. In my ideal world the result of (echo.exe "foo`nbar").GetType() would be byte[], not Object[]. In order to avoid breaking previous behavior and creating inconvenience, consumers of byte[] like the | operator or various Out-* cmdlets should try to guess the encoding of the stream.

This way we can have a | operator that decodes the byte[] into an array of lines before passing it on, as we have always had, but we can additionally have a || operator that simply invokes the next item without adjusting its arguments, and instead feeds the byte array into its stdin. Since cmdlets aren't full processes, they can have some corresponding argument that they will be fed the byte array in (and can then proceed to guess the encoding or do whatever they want).

@joeyaiello
Copy link
Contributor

After a long conversation and triage of all this feedback with @JamesWTruher, I think we're ready to pull the trigger with the @PowerShell/powershell-committee. Both of us feel very strongly that, given the shift in cross-platform tooling to Windows (e.g. VS Code, AWS tools, the decent heuristics in Notepad, etc.), and given our desire to maintain maximum consistency between platforms, the default across PowerShell 6.0 should be BOM-less UTF-8.

I want to thank everyone in this thread, especially @mklement0 and @iSazonov, for their extremely thorough analysis that enabled us to make this decision.

We still need a formal approval on this tomorrow. @JamesWTruher and I will be updating the existing RFC to reflect our thinking.

@iSazonov
Copy link
Contributor

iSazonov commented Jun 7, 2017

I don't like UTF8 :-) but I love beautiful ideas that are always simple and clear so I'll be happy with BOM-less UTF-8 PowerShell.

@BatmanAMA
Copy link

With UTF8 no BOM being default, is there still going to be a global variable to change this on systems that have issues? I like the direction of being more cross platform, but the unfortunate truth of enterprise is that I still assume I will inevitably have to deal with places where this is an issue.

@be5invis
Copy link

be5invis commented Jun 7, 2017

@BatmanAMA They can start Powershell in some "Version 5" mode...

@mklement0
Copy link
Contributor

mklement0 commented Jun 7, 2017

Thanks, @joeyaiello; that is great news indeed.

A few more thoughts:

  • If PS will become BOM-less UTF8 by default, the only "meta" $PSDefaultFileEncoding value needed will be Legacy (or perhaps WindowsLegacy, for clarity), which, I presume, will be the preset in the Desktop edition, and unset in Core.

  • In addition to defaulting all file-producing cmdlets (including those that don't support a specifiable encoding, such as New-Item and New-ModuleManifest) to BOM-less UTF-8, it is vital that:

    • anywhere a file without a BOM is read, it be interpreted as UTF-8.
    • the engine itself interpret source code, manifests, data files and the like as BOM-less UTF-8 by default (in the absence of a BOM).
    • $OutputEncoding default to BOM-less UTF-8
    • console windows be effectively initialized as follows: $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new(), where on Windows setting [console]::InputEncoding to UTF-8 makes chcp report 65001, the UTF-8 code page.
    • (Ironically, New-Item -Type File, unbeknownst to many, already - and invariably - creates BOM-less UTF-8 files.)
  • The (only) encoding forms mandated by the Unicode standard are UTF-8, UTF-16, and UTF-32 and their byte-order variants; currently, support for UTF-32BE is missing.

  • As has been suggested, being able to pass instances of System.Text.Encoding directly to -Encoding would be a nice addition.

  • Terminology (encoding names):

    • Unambiguous encoding names UTF8NoBOM and UTF8BOM should be introduced.

      • Arguably, UTF8 should be repurposed to BOM-less by default, and would only refer to a with-BOM encoding if $PSDefaultFileEncoding is set to Legacy.
    • Encoding name Unicode should be deprecated in favor of UTF16LE.

    • BigEndianUnicode should be deprecated in favor of UTF16BE

    • Default should be deprecated in favor of ANSI.

      • Perhaps Default can then reflect the effective default, i.e., BOM-less UTF-8 by default, and when $PSDefaultFileEncoding is set to Legacy, ANSI (system locale code page).
    • UTF32 should be deprecated in favor of UTF32LE, and UTF32BE introduced (along with underlying support for it).

    • As an aside: The term "BOM" is firmly entrenched by now (and wonderfully short), but the proper name for a sequence of bytes at the start of a file identifying an encoding scheme is Unicode signature.

  • It's understandable not to want to bother with Unix legacy encodings, but I wonder if a courtesy warning could be issued when PowerShell starts up with an LC_CTYPE (as reported by locale) value other than UTF-8.
    Note that the standard Unix utilities still do support legacy encodings.

  • The Unix model is alluringly simple: all utilities (except special purpose re-encoding ones) blindly assume that all byte streams that represent text use the encoding defined in the locale (LC_CTYPE) - they neither understand nor produce any other encoding (and have no concept of BOMs).

    • PowerShell's seamless ability to recognize BOMs - i.e., the ability to read encodings other than the default one - distinguishes it from other shells, but it does introduce the problem of preserving the specific input encoding on output; somehow making information about the input encoding available to user code would be helpful, as @rkeithhill has suggested before.

    • As an aside: using cat to illustrate BOM problems on Unix (in the current version of the RFC) is perhaps not the best choice: cat is not text-aware and blindly copies bytes from stdin to stdout, which includes not just the BOM, but also the NULs that result from the UTF-16LE-encoded ASCII-range characters; the terminal just happens not to render the NULs; you can verify their existence as follows: PS> 'hi'>t.txt; cat -v t.txt.

@mklement0
Copy link
Contributor

mklement0 commented Jun 7, 2017

@BatmanAMA: The RFC being discussed here proposes a new preference variable, $PSDefaultEncoding that, when set to Legacy, will continue to exhibit v5 behavior.

Technically, with .NET Core 2.0, I think this legacy behavior could now even be made available on Unix platforms, though I don't know why someone would need it there (the ability to selectively use the same encodings as available on Windows is helpful, however).

@masaeedu: Binary (raw-byte) pipelines and output redirection are being discussed here.

@ferventcoder
Copy link

@joeyaiello sounds great! Do you plan to address PowerShell/PowerShell#3466 for Authenticode signature verification?

@mklement0
Copy link
Contributor

mklement0 commented Jun 7, 2017

@ferventcoder:

If I understand the plan correctly, v6 will read everything that doesn't have a BOM as UTF-8 by default (unless $PSDefaultEncoding is set to Legacy), which will make the linked problem go away.

However, do note that using BOM-less UTF-8 script files is currently not just broken with respect to signature verification, but more fundamentally: your UTF-8-encoded non-ASCII characters in that script are misinterpreted as "ANSI"-encoded characters (the culture-specific, extended-ASCII, single-byte legacy encoding; typically, Windows-1252) - which may or may not surface as a problem, depending on whether these chars. are part of data in the script.

In short: with the current encoding behavior, UTF-8 scripts / files only work reliably as expected with a BOM.

This brings up an interesting migration pain point:

Say you switch to v6 and do not opt for the legacy encoding behavior: any "ANSI"-encoded legacy scripts would then have to be converted to UTF-8 in order to be read correctly (and given that there's no BOM for "ANSI" encoding, there's no other choice).
That said, given that UTF-8 is a superset of ASCII, only scripts that contain non-ASCII chars. are affected.

@ferventcoder
Copy link

I was more on the lean that you could have a UTF8 file that works with authenticode signature verification in PowerShell v6, but not in any earlier versions if it has unicode characters and no BOM. I'm fine with this, I just wanted to be sure this will be added to known incompatibilities in the documentation when the change is made.

@iSazonov
Copy link
Contributor

iSazonov commented Jun 18, 2017

From @mklement0:

If I understand the plan correctly, v6 will read everything that doesn't have a BOM as UTF-8 by default (unless $PSDefaultEncoding is set to Legacy ), which will make the linked problem go away.

The RFC says only about file creation cmdlets.


We're almost done Add dynamically generated set in ValidateSetAttribute for binary cmdlets. This opens up enormous opportunities in different areas. My main motivation was encoding parameters. By means of valid values generators, we can implement encoding parameters with any flexibility.
This will allow to support all encodings from System.Text.Encoding.
This will allow to easily switch to WindowsLegacy. We shouldn't enhance FileSystemCmdletProviderEncoding but use it as legacy. Instead, we should introduce the modern FileSystemCmdletProviderEncoding - PS6FileSystemCmdletProviderEncoding with members:

  • ISO-8559-1
  • UTF32BE
  • UTF32LE
  • UTF16BE
  • UTF16LE
  • UTF8

Also Byte member in the enum looks as legacy too. It is not encoding - it is a stream type. We should add it as separate parameter. We have some requests to support byte streams (ex., for native commands). This approach paves the way for this.

Set-Content -Encoding UTF8 -NoBOM -Byte

This is more intuitive for users, consistent with .Net Standard, and it is easier for us to manage the default settings.

@mklement0
Copy link
Contributor

The RFC says only about file creation cmdlets.

Yes, but it was clearly just a first draft, and the discussion here has shown us that the defaults that matter in the Unix world - and therefore are crucial to cross-platform success - do not just cover file creation. To recap, the defaults that matter are:

  • what encoding to use when writing without a BOM.
  • what encoding to use when reading without a BOM.

I think not ubiquitously reading and writing (BOM-less) UTF-8 (by default, unless overridden by configuration, or in situations where an established standard for a specific file format / protocol prescribes a different encoding) would prove very problematic in the long run.


Being able to dynamically generate the set of valid parameter values is a great addition.
How, specifically, do you envision their use with -Encoding? Have the single, existing parameter accept the previous enumeration values and instances of [System.Text.Encoding] / their names?

Ascii encoding shouldn't be replaced by ISO-8559-1, as there are cases when you truly only want 7-bit characters.

If we include ISO-8559-1 as an explicit enumeration value, we need to make it very clear that it is NOT the same encoding as Windows-1252 - even though the two are frequently conflated (to a point where HTML5 basically prescribes: treat ISO-8559-1 as if Windows-1252 had been specified).
ISO-8559-1 is a subset of Windows-1252 and both .NET and Unix locales define them properly, as distinct encodings. A notable absence is , so that if you read a Windows-1252-encoded file that has chars. in it as ISO-8559-1, they are transliterated to literal ? chars. (default behavior in .NET) or de/encoding breaks (iconv Unix utility); try
[System.Text.Encoding]::GetEncoding(28591).GetBytes('€') | Format-Hex

Enumeration name pairs such as BigEndianUTF16 / UTF16 to represent endianness are still Windows-centric and imprecise. In Unicode speak, UTF16 (UTF-16) is an abstract encoding form that is endianness-agnostic. Adding an endianness to an encoding form makes it a concrete encoding scheme, requiring an endianness suffix: LE or BE. In other words: To be standards-conformant (leaving punctuation aside), UTF16 should be UTF16LE, and BigEndianUnicode should be UTF16BE; while we need to preserve the old names for backward compatibility, this is an opportunity to introduce the proper names.

True, Byte is not an encoding - it is the opposite of one, if you will. That said, given that -Encoding, loosely speaking, tells the cmdlet how to interpret the data, including Byte as the do-not-interpret value makes sense to me, and then we needn't deal with a separate parameter that is mutually exclusive with -Encoding.

@iSazonov
Copy link
Contributor

@joeyaiello Was there a special opinion on Out-File cmdlet? Why is it placed in an optional section while it is a file creation cmdlet?

The RFC is only aimed at file creation cmdlet. I guess we'll have other RFCs for file and web cmdlets.

@mklement0

To be standards-conformant (leaving punctuation aside), UTF16 should be UTF16LE , and BigEndianUnicode should be UTF16BE

Fixed.

Ascii encoding shouldn't be replaced by ISO-8559-1 , as there are cases when you truly only want 7-bit characters.

I'm sorry I didn't explain that ASCII should stay in Legacy sets, in modern set it is more good use ISO-8559-1. If anybody want 7-bit characters he should use System.Text.Encoding.
And we could continue to consider ISO-8559-1 as Windows-1252. Or introduce a new name.

How, specifically, do you envision their use with -Encoding ? Have the single, existing parameter accept the previous enumeration values and instances of [System.Text.Encoding] / their names?

Encoding should be [String]. A valid value generator is a method which can accept all defaults (machine/user/powershell/platform/session/script/module/cmdlet) as parameters and generate the correct list of required encodings.
We can even add an option to accept (-ExtendedEncoding AcceptOnly|Suggest/Enable|Disable) extended encoding values (from System.Text.Encoding), but not to suggest them in IntelliSense.

Byte as the do-not-interpret value makes sense to me

My thoughts were about new requests that we received: binary streams, preserving encoding/re-encoding in pipes, redirections to/from native commands.

@amckinlay
Copy link

@mklement0

Enumeration name pairs such as BigEndianUTF16 / UTF16 to represent endianness are still Windows-centric and imprecise. In Unicode speak, UTF16 (UTF-16) is an abstract encoding form that is endianness-agnostic. Adding an endianness to an encoding form makes it a concrete encoding scheme, requiring an endianness suffix: LE or BE. In other words: To be standards-conformant (leaving punctuation aside), UTF16 should be UTF16LE, and BigEndianUnicode should be UTF16BE; while we need to preserve the old names for backward compatibility, this is an opportunity to introduce the proper names.

This is wrong. An encoding scheme does not require an endianness suffix in its name. The UTF-16 encoding scheme is a perfectly valid encoding scheme that can be either big-endian or little-endian, and may or may not have a BOM. The presence of a BOM is agnostic of the endianness. If a BOM is not present, the encoding scheme is big-endian unless some higher-level protocol says otherwise.

The UTF-16BE/LE encoding schemes do not permit a BOM. If U+FEFF is present at the beginning of the stream, it is interpreted as ZERO WIDTH NO-BREAK SPACE.

The same goes for the UTF-32 encoding schemes.

This is all according to section 3.10 in Unicode 10.

@mklement0
Copy link
Contributor

mklement0 commented Nov 23, 2017

An encoding scheme does not require an endianness suffix in its name.

Yes, it does. Without an endianness suffix, it is an abstract encoding form that does not prescribe byte order.

The UTF-16 encoding scheme is a perfectly valid encoding scheme that can be either big-endian or little-endian,

No: A scheme implies a specific byte order.

The UTF-16BE/LE encoding schemes do not permit a BOM

Indeed: a scheme - by virtue of implying the byte order - does not require or support a BOM.

That said, if you want to create a file, you do need a BOM, whose byte sequence the -Encoding scheme name determines; that BOM (Unicode signature), in the absence of encoding metadata, will tell readers of that file what encoding scheme was used to write the file.

For the distinction between encoding forms and schemes, see the relevant part of the Glossary of Unicode Terms.

The section 3.10 in chapter 3 of the v10 standard you mention deals with encoding schemes exclusively.

@TSlivede
Copy link

TSlivede commented Nov 23, 2017

@mklement0

No: A scheme implies a specific byte order.

I'm sorry that I have to disagree, but this quote from your link chapter 3 of the v10 standard says otherwise:

D98: UTF-16 encoding scheme: The Unicode encoding scheme that serializes a UTF-16
code unit sequence as a byte sequence in either big-endian or little-endian format.
• In the UTF-16 encoding scheme, the UTF-16 code unit sequence <004D 0430
4E8C D800 DF02> is serialized as <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or
<FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or <00 4D 04 30 4E 8C D8 00 DF 02>.
• In the UTF-16 encoding scheme, an initial byte sequence corresponding to
U+FEFF is interpreted as a byte order mark; it is used to distinguish between
the two byte orders. An initial byte sequence <FE FF> indicates big-endian
order, and an initial byte sequence <FF FE> indicates little-endian order. The
BOM is not considered part of the content of the text.
• The UTF-16 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol, the byte
order of the UTF-16 encoding scheme is big-endian.

PDF-Page: 61/89; Page in Headline: 132

@mklement0
Copy link
Contributor

mklement0 commented Nov 23, 2017

There's clearly an inconsistency in the standard with respect to terminology, as this passage from the start of section 3.9 illustrates (emphasis added):

The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and
UTF-8.

Clearly, calling something (UTF-16) both a form and a scheme, when these two words are clearly mean different things (as my earlier quote from the glossary shows) is problematic.

Pragmatically speaking, there are only UTF-16 encoding schemes (plural!) - big-endian and little-endian - that the absence of a BOM falls back to big-endian only goes to show that one or the other must be chosen.

So, yes, you could say that UTF-16 without a BOM should effectively be UTF-16 BE (whether that is heeded in practice is a different matter - see bottom).

However, relying on this in encoding names seems ill-advised, when UTF16LE and UTF16BE communicate the intent unambiguously.


The fact that PowerShell's Unicode encoding name creates UTF-16 LE files (and has a contrasting BigEndianUtf16 name) shows that its view of UTF-16 is little-endian-centric.

Similarly, when you (artificially) create a BOM-less file that is LE-encoded, Notepad still recognizes it as such, as opposed to blindly assuming BE, as the standard prescribes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests