-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Add 'utf8-sig' encoding option. #4039
Comments
What does "sig" mean? |
It was taken from python and means "UTF-8 codec with BOM signature". |
What issue/discussion is that? |
@koichik Thanks. I have no strong opinions. Auto-stripping the BOM is convenient (auto-writing is probably more involved) but it's easily handled by user code. Thoughts? |
I would vote for autostripping with default 'utf8' encoding, if it's possible.
Answering to @koichik 's discussion 11 months ago,
Conversion to utf8 is not loss-less, so one shouldn't expect to get the same result anyway. So, if anybody wants to preserve any non-text data, he can and should use buffers for that. PS: I've seen too many code that should strip BOM, but it doesn't... so, changing that would probably gain more than loss. |
I would be in favor of an auto-strip on read... however, it's writing that tends to be problematic, depending on what other applications you are interacting with. The larger issue tends to be reading, over writing. It was discussed at length before, and there's enough resistance that the existing encoding should probably not change behaviors... but adding this minor adjustment could re-use most of the code in place, with the minor difference to strip on read, and to inject on write. So the actual text is transparent to the application. I'm currently seeing the issue a lot more in the Windows land where a lot of programs tend to add it, and interacting with those files becomes a PITA, and would be much better if stripped. I'm hugely in favor of auto-strip on read, regardless. |
I am opposed to "auto-stripping" and "auto-writing" the BOM. I am also opposed to adding a new encoding type to handle this. The character U+FEFF is only interpreted as a BOM when it exists "at the beginning of a data stream", otherwise, it's interpreted as "Zero Width No-Break Space" (ZWNBSP). General-purpose unicode processors (like encoders & decoders) must be able to handle the ZWNBSP character appropriately when it occurs inline as "part of the content of the file or string". See What should I do with U+FEFF in the middle of a file?. Since decoding is done by OTOH, our current approach simply decodes
Writing a valid BOM to an output stream is just as trivial:
The point is that, only the application knows when it's dealing with the beginning of the file (or "data stream"), and when it's working somewhere in the middle. |
@coltrane So, the file system methods do not know when it's feeding the beginning of a file into the Buffer? The fact is, it is a VERY common issue, and one that imho should be handled at the framework level, as opposed to re-invented for every application or module that deals with text files. The whole purpose of a BOM at the beginning of a file is to have a hint to the content.. not to be part of the content... strip the sucker out. |
@tracker1 The purpose of a BOM is to indicate byte order for UTF-16 and UTF-32 encodings. The whole using it in UTF-8 is a Windows Notepad hack. And the file system methods probably don't know - as you might be opening things that aren't files or not opening things at the beginning. |
@tracker1: it probably would be possible to add this functionality to a few specific filesystem methods as a special case, but it would not be possible to add it, with any sort of reasonable semantics, to all the other places where encodings are used (Buffer, StringDecoder, net streams, and many other places). IMO, handling the BOM belongs in application logic. If you'd like to avoid re-inventing that logic in each application, then build re-usable classes that will wrap node's lower-level streams, to provide more full-featured handling of complete unicode files. |
Agree. And it means that
Don't agree. According to wiki, In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060. This allows U+FEFF to be only used as a BOM.. It means that since 3/27/2002 U+FEFF should be used only as BOM, and it means that we can safely strip all U+FEFFs whatsoever. |
The Unicode Consortium uses the term "deprecated" slightly differently than most other standards bodies. Unicode Character Database, Section 5.12 explains. It concludes by pointing out:
The Unicode standard disagrees.
At the application-level you might choose to strip all U+FEFFs if you know that you're working with a file or protocol where ZWNBSP should never occur. But Node's encoder/decoder is a generic utility that might operate on any kind of Unicode document or fragment. It needs to be able to decode all valid Unicode content correctly -- even if that content includes ZWNBSP. |
@coltrane , all right, I'll try to make another argument here. You are right, Buffer does not have the neccesary context to determine if it have a beginning of a data stream. And that's why Buffer must assume that is starts with the beginning of a data stream, and it's up to application to provide it the neccesary context if it's not. If an application tries to fill a buffer with an arbitrary part of a data stream and convert it to unicode then, it should be prepared for data-loss. If it is not, it is wrong, it is broken, and you should make a bugreport there. PS: So why on Earth do you think that |
Is this discussion about stripping of buffer beginning? Buffer is NOT stream. Stream has no encoding (in node), sure? All other should be done in userland. |
@langpavel , this discussion is about how to deal with BOM at the beginning of a buffer doing Buffer.toString() |
@rlidwka I know. But can you predict inpact of introducing this? And strange bugs with understanding? This is deffered to userland from core and that is good from my point of view. We cannot distinguish between beginning of stream and next packet in |
@langpavel , we already have these bugs. If some software converts a packet to utf-8, it will fail when first character of a multibyte utf-8 sequence will appear in one packet and last character in another. This way "тест" will become "те��т" if 3rd letter gets splitted between two packets like that: So yes, I can predict an impact of this. Good software will still be good, buggy software will still be buggy. But some software that didn't know about BOM, maybe will start to work right. |
@rlidwka ok, agreed. Can you explain scenario little more? I thing this should be described as another bug. What you say is we try to resolve second or more data packet before concatenating.. Ok, this can be bug, but not at Solution should be transparent _encoding stream_ giving you only valid characters. You cannot give some bytes to My suggestion is new |
I'm trying to say that But if you don't have a data, but a stream, you should use So that argument about "encoder/decoder is a generic utility that might operate on any kind of Unicode document or fragment" is invalid, and Buffer should always have all available data and never work with fragments.
This encoding package is called |
Ok, my mistake.
Is there problem with This suggests introducing a flag on stream (or separate encoding name as you suggest. I'm not sure if this is fine.., is there reason to impliciting this flag to truthy? -- on non HTTP... )? Hatly patly I'm sorry... Summary:
|
Yep, but fs.readFile() returns one big Buffer, if I'm not mistaken, no streams there. So the whole discussion is about buffers. |
@rlidwka But you can give encoding to |
Yeah, and that's how it's implemented: https://github.com/joyent/node/blob/master/lib/fs.js#L184 |
There's also |
No activity in over a year. Closing. |
Per an earlier issue/discussion, a separate 'utf8-sig' encoding option to strip the BOM when reading, and add it when writing would be beneficial. This mirrors python's solution for the same issue.
It would also be good for various utilities (less, uglify, etc) that will often read files in utf-8 to have an option upstream for this.
The text was updated successfully, but these errors were encountered: