-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC0020 Default File Encoding #71
Comments
I am so in favor of this @JamesWTruher! Dealing with BOM just makes life sad in a cross-platform world. |
If we use And what will be the standard in future? If it is UTF8 as It would be nice to even consider a scenario with a target system when a Windows script works with files on another system - Linux - and vice versa. @rkeithhill Welcome to comment the RFC. |
At first glance I do like the idea of PS v6 on Windows defaulting to UTF8 (or UTF8NoBOM) but I think you would still need a way to set the old |
@iSazonov I've talked about this at length with the @PowerShell/powershell-committee and @JamesWTruher: we don't want to break existing scripts on the Windows side, hence the need for the Add onto to that the problem that the market still hasn't decided on a default encoding--MS and Windows still heavily uses @rkeithhill that's absolutely something we need to start solving soon rather than later. There's a ton of platform differences for which we should have some configurable mechanisms--e.g. file encodings, path delimiters, case sensitivity, line endings, etc.--and it's clear to me that we need some way to flip all of them between the Windows and Linux defaults. I'd rather not hide that all behind the |
@joeyaiello Thanks for clarify! I agree with I have more questions.
|
I think for this proposal to be implemented correctly, PowerShell should first offer a way to treat existing enums as subsets of one another e.g. enum encoding {
Unknown
String
Unicode
BigEndianUnicode
UTF8
UTF7
UTF32
Ascii
Default
Oem
BigEndianUTF32
}
enum fsencoding : encoding {
Byte
} should be as if you did: enum fsencoding : [enum]::GetUnderlyingType([encoding]) {
// [encoding] members pasted in:
Unknown
String
...
Oem
BigEndianUTF32
// and now [fsencoding] members:
Byte
}
# And the following should also work:
[fsencoding]$foo = [encoding]::Ascii The reason is that:
|
@DerpMcDerp: that would be useful to have, but given that this is implemented in C#, is that actually necessary to implement for this RFC (I'm not a C# wiz, but my gut tells me no). |
Yes. I believe we use
I used the article:
From Unicode FAQ:
From the same FAQ:
As non-English user I often do re-encoding in my scripts. Sometimes it's a real headache. Definitely we should get rid of diversity using encodings. So I would prefer that PoweShell Core used Unicode everywhere. Many programs are sensitive to system/session locale. I believe we must remove the regression with Default and OEM. All the more so because Linux is also based on locale. Based on this it would be great if PowerShell Core use UTF16 (WithBOM) as default on Windows with Legacy option for backward compatibility with Windows PowerShell and UTF8 (NoBOM) as default on Unix. |
I just skimmed at how PowerShell currently handles encoding and it's an inconsistent clutter. Here is an approximation of how PowerShell currently works if PowerShell used explicit enums instead of // we need the following existing enum
public enum FileSystemCmdletProviderEncoding {
Unknown,
String,
Unicode,
Byte,
BigEndianUnicode,
UTF8,
UTF7,
UTF32,
Ascii,
Default,
Oem,
BigEndianUTF32
}
// and the following 2 synthesized enums:
enum FakeEncodingEnum1 {
Unicode,
BigEndianUnicode,
UTF8,
UTF7,
UTF32,
Ascii,
Default,
Oem
}
enum FakeEncodingEnum2 {
Unknown,
String,
Unicode,
BigEndianUnicode,
UTF8,
UTF7,
UTF32,
Ascii,
Default,
Oem
} And then here is how the existing cmdlets operate:
enum FakeEncodingEnum3 {
Unknown,
String,
Unicode,
BigEndianUnicode,
UTF8,
UTF7,
UTF32,
Ascii,
Default,
Oem,
UTF8NoBom
}
MasterStreamOpen(
PSCmdlet cmdlet,
string filePath,
// string encoding,
// bool defaultEncoding,
FakeEncodingEnum3 encoding, /* new */
bool Append,
bool Force,
bool NoClobber,
out FileStream fileStream,
out StreamWriter streamWriter,
out FileInfo readOnlyFileInfo,
bool isLiteralPath) |
We should pay attention to what |
It should be noted that CoreFX use UTF8NoBOM as default for Streams. |
@DerpMcDerp: Great sleuthing. Two points (not sure if I'm misinterpreting your table):
|
There are already significant breaking changes between PS v5 and PS Core v6 on Windows. Why not take this chance bring some consistency to the default file encoding on Windows? And since you are giving us the ability to set $PSDefaultFileEncoding (you are, aren't you?), I will change it to UTF8. That means other folks' scripts could break on my system. So I need some sort of mechanism in which to run these script in & { $PSDefaultFileEncoding='Legacy'; ./old-script.ps1 } Or maybe I pester the script author to set the $PSDefaultFileEncoding to Legacy for their script. So this can be fixed within and without. Perhaps this could be reason to have a |
It's a "ps2" beautiful idea. But it will make our code too complex to support (practically each subsystem will ensure compatibility mode: ex., for parameter bindings it will be a big headache). I think it unnecessarily support "ps2" because we can directly use Windows PowerShell for old scripts and PowerShell Core for new scripts and we already have
|
From a Linux-aware developer standpoint, I have long been frustrated by PowerShell's handling of encoding (for
👍 If the only change to come from this RFC is the ability to change
👍 Even better.
👍 I love the thought of my functions with
👍 Important to keep this as an option for legacy scripts.
👎 I'm with @rkeithhill: it's already unreasonable to expect v5 scripts to work unaltered with Core v6. Make a big deal about encoding on the migration checklist, and use this as an opportunity to educate people about the perils of not using Unicode.
👍 More generally, this is a chance to educate people about the perils of not being explicit about file encoding. Once this is supported, scripts that expect files to be written with a specific encoding will need to be explicit regardless of system defaults. Since scripts targeting v5 are expected to require alteration, and since scripts targeting Core are reasonably expected to set |
@dahlbyk: All good points. As an aside: In PSv5.1+ you now can change the encoding of # PSv5.1+: Make Out-File/>/>> create UTF-8 files with BOM by default:
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8' |
All great points, thanks for the feedback everyone. First, we really don't want to go down the That being said, we know that we need some higher-level mechanism to encapsulate and expose the differences in defaults on *nix and Windows. Using the encoding issue as a strawman for some of the other usability defaults (e.g. aliases, path delimiters, etc.), we want to some how make these things modifiable in one fell swoop so that, for instance, a script, module, or user's profile could put itself in "Linux mode" on a Windows machine. I owe an RFC on this. With regards to encoding itself, I'm glad to see general consensus that this is the right approach. Given the level of feedback I've heard on "change the Windows defaults", I think it's something we need to seriously consider. I'd challenge those in favor of changing the Windows defaults to come up with some negative cases that invalidate that opinion--in other words, what's the worst possible thing that could happen to an existing script/user/workload if we make all versions of PowerShell 6 do |
The three risks that immediately come to mind are:
|
I create new UTF8NoBOM file in VS Code with Latin and Russian characters. Open it in Notepad, modify, save - all works well. I tested that files in UTF8NoBOM Excel 2013 loads and parses with no problems. It seems native Windows apps understand UTF8NoBOM (only still use defaults to UTF16LE). Tested on Windows 10. (Current Windows behavior is: Read files with auto detect of encoding (and UTF16LE) default and create new files with UTF16LE default.) By the way the .Net team has already had to go through this stage and had a conclusion on this issue because they've changed some defaults on UTF-8. If it was internal it would be great to publish this conclusion, motivation and reasoning. |
Actually, on saving, Notepad blindly prepends a UTF-8 BOM, even if the file had none (tried on Windows 7 and Windows 10). In other words: it correctly detects BOM-less UTF-8 and preserves UTF-8 coding on saving in principle, but always prepends a BOM. Excel's behavior is wildly inconsistent (tested on 2007 and 2016): BOM-less UTF-8 in *.txt or *.csv files is correctly detected when you interactively open a file (by the wizard UI); by contrast, passing a file from the command line defaults to "ANSI" interpretation. On saving such a file, the encoding reverts to "ANSI". Additionally, Excel cannot handle big-endian Unicode files even with a BOM, and for little-endian UTF-32 with BOM reverts to UTF-16 LE with BOM on saving.
Notepad creates "ANSI"-encoded files by default - at least on my Windows 7 and Windows 10 machines. Excel creates "ANSI" *.txt and *.csv files by default, and only creates UTF16LE *.txt files if you explicitly save in format "Unicode text"; for CSV files there is no such option, but in Excel 2016 you can save with UTF-8-with-BOM encoding ("CSV UTF-8 (Comma delimited)") - no such luck in 2007. In short: there is no consistency in the world of Windows applications, at least with respect to these two frequently used applications.
The .NET Framework's true default behavior (which also applies to .NET Core) - what encoding is used if you do not specify one explicitly - both for reading and writing - has always been BOM-less UTF-8, at least for the This is also how it works on Unix with the now-ubiquitous UTF-8-based locales, although there you don't get the benefit of BOM detection - the only thing the Unix utilities natively understand is BOM-less UTF-8. Trying to emulate default Windows behavior on Unix is problematic, but without it, any existing scripts that use "ANSI" encoding themselves (to encode the source code) - which is quite likely - are potentially misread by PowerShell Core itself - which is annoying in comments, but the real problem arises for scripts / module manifests (data files) that have data with non-ASCII characters, such as identifiers (function names, ...), string literals, and DATA sections. Note that this is already a problem: popular cross-platform editors such as VSCode, Atom, and Sublime Text all default to BOM-less UTF-8, so if you create your
The question is: what is the expectation with respect to using Windows-originated source-code files on Unix? Are they expected to work as-is, without re-encoding? |
@mklement0 I update my post (tested on Windows 10 and Excel 2013). |
I believe it is a subject for ScriptAnalizer. Also we would add warnings in Windows PowerShell and maybe in PowerShell Core. |
I opened issue-question in .Net Core repo and get great comments about defaults dotnet/standard#260 (comment) |
The following heuristics may be useful. NTFS/SMB
EXT
Sample scenario: An Unix user mount SMB share and generate files for Windows users. |
Basing the case-sensitivity on the underlying filesystem sounds like a great idea; note that the macOS filesystem, HFS+, is by default also case-insensitive. By contrast, deciding what default encoding to use based on the filesystem strikes me as problematic. |
Hi, I know I'm very late ( I absolutely like the Idea to make the This way, it would be easy to use custom or unusual encodings and the user is not restricted to the encodings in |
@TSlivede The discussion is open and very important - welcome! |
I want to make an interim summary based on the discussion and other discussions over the past six months. There is no clear-cut solution. Settings is:
We should support backward-compatibility settings and modern settings. All file systems have the same capabilities. Different flavors are entered at a more higher level OS API. So we should be OS-based not filesystem-based. We should support sets of encodings (maybe as bitwise options):
|
I think the best file encoding is no file encoding. In my ideal world the result of This way we can have a |
After a long conversation and triage of all this feedback with @JamesWTruher, I think we're ready to pull the trigger with the @PowerShell/powershell-committee. Both of us feel very strongly that, given the shift in cross-platform tooling to Windows (e.g. VS Code, AWS tools, the decent heuristics in Notepad, etc.), and given our desire to maintain maximum consistency between platforms, the default across PowerShell 6.0 should be BOM-less UTF-8. I want to thank everyone in this thread, especially @mklement0 and @iSazonov, for their extremely thorough analysis that enabled us to make this decision. We still need a formal approval on this tomorrow. @JamesWTruher and I will be updating the existing RFC to reflect our thinking. |
I don't like UTF8 :-) but I love beautiful ideas that are always simple and clear so I'll be happy with BOM-less UTF-8 PowerShell. |
With UTF8 no BOM being default, is there still going to be a global variable to change this on systems that have issues? I like the direction of being more cross platform, but the unfortunate truth of enterprise is that I still assume I will inevitably have to deal with places where this is an issue. |
@BatmanAMA They can start Powershell in some "Version 5" mode... |
Thanks, @joeyaiello; that is great news indeed. A few more thoughts:
|
@BatmanAMA: The RFC being discussed here proposes a new preference variable, Technically, with .NET Core 2.0, I think this legacy behavior could now even be made available on Unix platforms, though I don't know why someone would need it there (the ability to selectively use the same encodings as available on Windows is helpful, however). @masaeedu: Binary (raw-byte) pipelines and output redirection are being discussed here. |
@joeyaiello sounds great! Do you plan to address PowerShell/PowerShell#3466 for Authenticode signature verification? |
If I understand the plan correctly, v6 will read everything that doesn't have a BOM as UTF-8 by default (unless However, do note that using BOM-less UTF-8 script files is currently not just broken with respect to signature verification, but more fundamentally: your UTF-8-encoded non-ASCII characters in that script are misinterpreted as "ANSI"-encoded characters (the culture-specific, extended-ASCII, single-byte legacy encoding; typically, Windows-1252) - which may or may not surface as a problem, depending on whether these chars. are part of data in the script. In short: with the current encoding behavior, UTF-8 scripts / files only work reliably as expected with a BOM. This brings up an interesting migration pain point: Say you switch to v6 and do not opt for the legacy encoding behavior: any "ANSI"-encoded legacy scripts would then have to be converted to UTF-8 in order to be read correctly (and given that there's no BOM for "ANSI" encoding, there's no other choice). |
I was more on the lean that you could have a UTF8 file that works with authenticode signature verification in PowerShell v6, but not in any earlier versions if it has unicode characters and no BOM. I'm fine with this, I just wanted to be sure this will be added to known incompatibilities in the documentation when the change is made. |
From @mklement0:
The RFC says only about file creation cmdlets. We're almost done Add dynamically generated set in ValidateSetAttribute for binary cmdlets. This opens up enormous opportunities in different areas. My main motivation was encoding parameters. By means of valid values generators, we can implement encoding parameters with any flexibility.
Also Set-Content -Encoding UTF8 -NoBOM -Byte This is more intuitive for users, consistent with .Net Standard, and it is easier for us to manage the default settings. |
Yes, but it was clearly just a first draft, and the discussion here has shown us that the defaults that matter in the Unix world - and therefore are crucial to cross-platform success - do not just cover file creation. To recap, the defaults that matter are:
I think not ubiquitously reading and writing (BOM-less) UTF-8 (by default, unless overridden by configuration, or in situations where an established standard for a specific file format / protocol prescribes a different encoding) would prove very problematic in the long run. Being able to dynamically generate the set of valid parameter values is a great addition.
If we include Enumeration name pairs such as True, |
@joeyaiello Was there a special opinion on Out-File cmdlet? Why is it placed in an optional section while it is a file creation cmdlet? The RFC is only aimed at file creation cmdlet. I guess we'll have other RFCs for file and web cmdlets.
Fixed.
I'm sorry I didn't explain that ASCII should stay in Legacy sets, in modern set it is more good use ISO-8559-1. If anybody want 7-bit characters he should use
Encoding should be [String]. A valid value generator is a method which can accept all defaults (machine/user/powershell/platform/session/script/module/cmdlet) as parameters and generate the correct list of required encodings.
My thoughts were about new requests that we received: binary streams, preserving encoding/re-encoding in pipes, redirections to/from native commands. |
This is wrong. An encoding scheme does not require an endianness suffix in its name. The UTF-16 encoding scheme is a perfectly valid encoding scheme that can be either big-endian or little-endian, and may or may not have a BOM. The presence of a BOM is agnostic of the endianness. If a BOM is not present, the encoding scheme is big-endian unless some higher-level protocol says otherwise. The UTF-16BE/LE encoding schemes do not permit a BOM. If The same goes for the UTF-32 encoding schemes. This is all according to section 3.10 in Unicode 10. |
Yes, it does. Without an endianness suffix, it is an abstract encoding form that does not prescribe byte order.
No: A scheme implies a specific byte order.
Indeed: a scheme - by virtue of implying the byte order - does not require or support a BOM. That said, if you want to create a file, you do need a BOM, whose byte sequence the For the distinction between encoding forms and schemes, see the relevant part of the Glossary of Unicode Terms. The section 3.10 in chapter 3 of the v10 standard you mention deals with encoding schemes exclusively. |
I'm sorry that I have to disagree, but this quote from your link chapter 3 of the v10 standard says otherwise:
PDF-Page: 61/89; Page in Headline: 132 |
There's clearly an inconsistency in the standard with respect to terminology, as this passage from the start of section 3.9 illustrates (emphasis added):
Clearly, calling something (UTF-16) both a form and a scheme, when these two words are clearly mean different things (as my earlier quote from the glossary shows) is problematic. Pragmatically speaking, there are only UTF-16 encoding schemes (plural!) - big-endian and little-endian - that the absence of a BOM falls back to big-endian only goes to show that one or the other must be chosen. So, yes, you could say that However, relying on this in encoding names seems ill-advised, when The fact that PowerShell's Similarly, when you (artificially) create a BOM-less file that is LE-encoded, Notepad still recognizes it as such, as opposed to blindly assuming BE, as the standard prescribes. |
Committe has voted and accepted this RFC: https://github.com/PowerShell/PowerShell-RFC/blob/master/2-Draft-Accepted/RFC0020-DefaultFileEncoding.md
Feedback for https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md
The text was updated successfully, but these errors were encountered: