Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Add 'utf8-sig' encoding option. #4039

Closed
tracker1 opened this issue Sep 20, 2012 · 27 comments
Closed

Add 'utf8-sig' encoding option. #4039

tracker1 opened this issue Sep 20, 2012 · 27 comments

Comments

@tracker1
Copy link

Per an earlier issue/discussion, a separate 'utf8-sig' encoding option to strip the BOM when reading, and add it when writing would be beneficial. This mirrors python's solution for the same issue.

It would also be good for various utilities (less, uglify, etc) that will often read files in utf-8 to have an option upstream for this.

@OrangeDog
Copy link

What does "sig" mean?
"utf8-bom" would make more sense to me.

@dougwilson
Copy link
Member

It was taken from python and means "UTF-8 codec with BOM signature".

@bnoordhuis
Copy link
Member

Per an earlier issue/discussion

What issue/discussion is that?

@koichik
Copy link

koichik commented Sep 24, 2012

@bnoordhuis - #1918

@bnoordhuis
Copy link
Member

@koichik Thanks.

I have no strong opinions. Auto-stripping the BOM is convenient (auto-writing is probably more involved) but it's easily handled by user code. Thoughts?

@rlidwka
Copy link

rlidwka commented Sep 24, 2012

I would vote for autostripping with default 'utf8' encoding, if it's possible.

Per an earlier issue/discussion, a separate 'utf8-sig' encoding option to strip the BOM when reading

utf8-sig is not good enough, because it's just another encoding that should be maintained.

Answering to @koichik 's discussion 11 months ago,

var text = fs.readFileSync('foo.txt', 'utf8');
fs.writeFileSync('foo.txt', text, 'utf8');

Conversion to utf8 is not loss-less, so one shouldn't expect to get the same result anyway. So, if anybody wants to preserve any non-text data, he can and should use buffers for that.

PS: I've seen too many code that should strip BOM, but it doesn't... so, changing that would probably gain more than loss.

@tracker1
Copy link
Author

I would be in favor of an auto-strip on read... however, it's writing that tends to be problematic, depending on what other applications you are interacting with. The larger issue tends to be reading, over writing. It was discussed at length before, and there's enough resistance that the existing encoding should probably not change behaviors... but adding this minor adjustment could re-use most of the code in place, with the minor difference to strip on read, and to inject on write. So the actual text is transparent to the application.

I'm currently seeing the issue a lot more in the Windows land where a lot of programs tend to add it, and interacting with those files becomes a PITA, and would be much better if stripped. I'm hugely in favor of auto-strip on read, regardless.

@coltrane
Copy link

I am opposed to "auto-stripping" and "auto-writing" the BOM. I am also opposed to adding a new encoding type to handle this.

The character U+FEFF is only interpreted as a BOM when it exists "at the beginning of a data stream", otherwise, it's interpreted as "Zero Width No-Break Space" (ZWNBSP). General-purpose unicode processors (like encoders & decoders) must be able to handle the ZWNBSP character appropriately when it occurs inline as "part of the content of the file or string". See What should I do with U+FEFF in the middle of a file?.

Since decoding is done by Buffer, we don't have the context necessary to determine whether we're starting from the "beginning of a data stream", or from somewhere in the middle. Indiscriminately dropping occurrences of U+FEFF is not an option, as it would result in also dropping perfectly valid ZWNBSP characters from the content stream.

OTOH, our current approach simply decodes EF BB BF to the character U+FEFF -- this is always correct, regardless of whether it represents BOM or ZWNBSP. As long as we continue do this consistently, it's trivial for the application to check the first character of a file or other "data stream", and skip it if it represents a BOM.

var myStr = myBuffer.toString('utf8');
if (myStr.charAt(0) === '\uFEFF') myStr = myStr.substr(1);

Writing a valid BOM to an output stream is just as trivial:

myOutStream.write('\uFEFF' + myStr, 'utf8');

The point is that, only the application knows when it's dealing with the beginning of the file (or "data stream"), and when it's working somewhere in the middle.

@tracker1
Copy link
Author

@coltrane So, the file system methods do not know when it's feeding the beginning of a file into the Buffer?

The fact is, it is a VERY common issue, and one that imho should be handled at the framework level, as opposed to re-invented for every application or module that deals with text files.

The whole purpose of a BOM at the beginning of a file is to have a hint to the content.. not to be part of the content... strip the sucker out.

@OrangeDog
Copy link

@tracker1 The purpose of a BOM is to indicate byte order for UTF-16 and UTF-32 encodings. The whole using it in UTF-8 is a Windows Notepad hack.

And the file system methods probably don't know - as you might be opening things that aren't files or not opening things at the beginning.

@coltrane
Copy link

@tracker1: it probably would be possible to add this functionality to a few specific filesystem methods as a special case, but it would not be possible to add it, with any sort of reasonable semantics, to all the other places where encodings are used (Buffer, StringDecoder, net streams, and many other places).

IMO, handling the BOM belongs in application logic. If you'd like to avoid re-inventing that logic in each application, then build re-usable classes that will wrap node's lower-level streams, to provide more full-featured handling of complete unicode files.

@rlidwka
Copy link

rlidwka commented Sep 25, 2012

Since decoding is done by Buffer, we don't have the context necessary to determine whether we're starting from the "beginning of a data stream", or from somewhere in the middle.

Agree. And it means that Buffer should not insert BOM by default.

Indiscriminately dropping occurrences of U+FEFF is not an option, as it would result in also dropping perfectly valid ZWNBSP characters from the content stream.

Don't agree. According to wiki, In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060. This allows U+FEFF to be only used as a BOM..

It means that since 3/27/2002 U+FEFF should be used only as BOM, and it means that we can safely strip all U+FEFFs whatsoever.

@coltrane
Copy link

@rlidwka
According to wiki, In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060. This allows U+FEFF to be only used as a BOM..

The Unicode Consortium uses the term "deprecated" slightly differently than most other standards bodies. Unicode Character Database, Section 5.12 explains. It concludes by pointing out:

... Conformant implementations of Unicode processes such as Unicode normalization *must* 
handle even deprecated characters correctly.

It means that since 3/27/2002 U+FEFF should be used only as BOM, and it means that we can safely strip all U+FEFFs whatsoever.

The Unicode standard disagrees.
From Unicode Standard, Chapter 16.8:

If U+FEFF had only the semantics of a signature code point, it could be freely deleted from 
text without affecting the interpretation of the rest of the text. Carelessly appending files 
together, for example, can result in a signature code point in the middle of text. Unfortunately, 
U+FEFF also has significance as a character. As a zero width no-break space, it indicates 
that line breaks are not allowed between the adjoining characters. Thus U+FEFF affects the
interpretation of text and cannot be freely deleted.

At the application-level you might choose to strip all U+FEFFs if you know that you're working with a file or protocol where ZWNBSP should never occur. But Node's encoder/decoder is a generic utility that might operate on any kind of Unicode document or fragment. It needs to be able to decode all valid Unicode content correctly -- even if that content includes ZWNBSP.

@rlidwka
Copy link

rlidwka commented Sep 26, 2012

@coltrane , all right, I'll try to make another argument here.

You are right, Buffer does not have the neccesary context to determine if it have a beginning of a data stream. And that's why Buffer must assume that is starts with the beginning of a data stream, and it's up to application to provide it the neccesary context if it's not.

If an application tries to fill a buffer with an arbitrary part of a data stream and convert it to unicode then, it should be prepared for data-loss. If it is not, it is wrong, it is broken, and you should make a bugreport there.

PS:
Buffer([0xd0]).toString() + Buffer([0xb0]).toString() doesn't equal to Buffer([0xd0, 0xb0]).toString(), you can test that

So why on Earth do you think that Buffer([0x20]).toString() + Buffer([0xef, 0xbb, 0xbf]).toString() should be equal to Buffer([0x20, 0xef, 0xbb, 0xbf]).toString() ?

@langpavel
Copy link

Is this discussion about stripping of buffer beginning? Buffer is NOT stream. Stream has no encoding (in node), sure? All other should be done in userland.

@rlidwka
Copy link

rlidwka commented Sep 28, 2012

@langpavel , this discussion is about how to deal with BOM at the beginning of a buffer doing Buffer.toString()

@langpavel
Copy link

@rlidwka I know. But can you predict inpact of introducing this? And strange bugs with understanding? This is deffered to userland from core and that is good from my point of view. We cannot distinguish between beginning of stream and next packet in Buffer.

@rlidwka
Copy link

rlidwka commented Sep 28, 2012

@langpavel , we already have these bugs.

If some software converts a packet to utf-8, it will fail when first character of a multibyte utf-8 sequence will appear in one packet and last character in another. This way "тест" will become "те��т" if 3rd letter gets splitted between two packets like that: d1 82 d0 b5 d1 + 81 d1 82.

So yes, I can predict an impact of this. Good software will still be good, buggy software will still be buggy. But some software that didn't know about BOM, maybe will start to work right.

@langpavel
Copy link

@rlidwka ok, agreed. Can you explain scenario little more? I thing this should be described as another bug. What you say is we try to resolve second or more data packet before concatenating.. Ok, this can be bug, but not at Buffer. This is point. You wish to translate something without context.

Solution should be transparent _encoding stream_ giving you only valid characters.

You cannot give some bytes to Buffer and expect correct output.

My suggestion is new encoding package which should behave as transparent stream.

@rlidwka
Copy link

rlidwka commented Sep 28, 2012

Ok, this can be bug, but not at Buffer. This is point. You wish to translate something without context.

I'm trying to say that Buffer.toString() call should be used only if Buffer have all the data (like reading the entire file). And it means that stripping BOM at the beginning of a Buffer is quite natural.

But if you don't have a data, but a stream, you should use stream.setEncoding('utf-8') instead.

So that argument about "encoder/decoder is a generic utility that might operate on any kind of Unicode document or fragment" is invalid, and Buffer should always have all available data and never work with fragments.

My suggestion is new encoding package which should behave as transparent stream.

This encoding package is called string_decoder and it used implicitly if you call stream.setEncoding, so anything is right here (except of unclear documentation maybe).

@langpavel
Copy link

This encoding package is called string_decoder and it used implicitly if you call stream.setEncoding, so anything is right here (except of unclear documentation maybe).

Ok, my mistake.

I'm trying to say that Buffer.toString() call should be used only if Buffer have all the data (like reading the entire file). And it means that stripping BOM at the beginning of a Buffer is quite natural.

Is there problem with setEncoding ? Probably the BOM you are suggesting from the begginig, sure? Ok, stripping BOM on UTF-8 stream beggining should be default. (Is there case to NOT?)

This suggests introducing a flag on stream (or separate encoding name as you suggest. I'm not sure if this is fine.., is there reason to impliciting this flag to truthy? -- on non HTTP... )? Hatly patly I'm sorry...

Summary:

  • No encoding on buffer. Buffer is bytes
  • Stream should detect (UTF-8 BOM but in future not only (sure no magick in future..)) and shooting out only valid strings (if asked for, otherwise Buffer)

@rlidwka
Copy link

rlidwka commented Sep 28, 2012

Yep, but fs.readFile() returns one big Buffer, if I'm not mistaken, no streams there. So the whole discussion is about buffers.

@langpavel
Copy link

@rlidwka But you can give encoding to fs.readFile call, don't you? And in this case you got what you wand, defauts to binary Buffer instance, sure?

@rlidwka
Copy link

rlidwka commented Sep 28, 2012

But you can give encoding to fs.readFile call, don't you? And in this case you got what you wand, defauts to binary Buffer instance, sure?

Yeah, and that's how it's implemented: https://github.com/joyent/node/blob/master/lib/fs.js#L184

@bnoordhuis
Copy link
Member

Yep, but fs.readFile() returns one big Buffer, if I'm not mistaken, no streams there. So the whole discussion is about buffers.

There's also fs.createReadStream() and fs.createWriteStream().

@bnoordhuis
Copy link
Member

No activity in over a year. Closing.

@tracker1
Copy link
Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants