Add TextEncoderStream and TextDecoderStream transform streams #149

ricea · 2018-07-18T09:23:22Z

Integrate with the streams standard by adding TextEncoderStream and
TextDecoderStream transform streams to the standard. These enable
binary<>text conversions on a ReadableStream using the pipeThrough()
method (see https://streams.spec.whatwg.org/#rs-pipe-through).

A TextEncoderStream object can be used to transform a stream of strings
to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can
be used to transform a stream of bytes in the encoding passed to the
constructor to strings.

There is a prollyfill and tests for the new functionality at
https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill.

Closes #72.

Preview | Diff

Integrate with the streams standard by adding TextEncoderStream and TextDecoderStream transform streams to the standard. These enable binary<>text conversions on a ReadableStream using the `pipeThrough()` method (see https://streams.spec.whatwg.org/#rs-pipe-through). A TextEncoderStream object can be used to transform a stream of strings to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can be used to transform a stream of bytes in the encoding passed to the constructor to strings. There is a prollyfill and tests for the new functionality at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill. Closes whatwg#72.

domenic

So sorry on the delay on this. Will be more responsive in the future.

I left editorial comments about spec organization. The algorithms seem solid to me from the streams perspective, but I am not an expert on the encoding side of things.

Overall this looks pretty good. Hopefully @annevk can review as well.

domenic · 2018-07-31T16:01:07Z

encoding.bs

@@ -15,6 +15,19 @@ Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeo
 spec:infra; type:dfn;
    text:code point
    text:ascii case-insensitive
+spec:streams;
+    type:interface; text:ReadableStream
+    type:dfn; text:chunk


We should export these in streams so that this is not necessary, for the dfns at least. For the interface, that's a separate issue: whatwg/fetch#780 so we can leave it here for now.

All except "transform stream" were already exported. I created whatwg/streams#949 to export that too.

domenic · 2018-07-31T16:02:43Z

encoding.bs

+and <dfn for=TextDecoderAttributes>error mode</dfn>.
+
+<p>The <dfn attribute for=TextDecoderAttributes><code>encoding</code></dfn> attribute's getter must
+return <a for=TextDecoderAttributes>encoding</a>'s <a for=encoding>name</a> in <a>ASCII


While moving things around maybe use the modern phrasing "this object's encoding's" or "this TextDecoderAttribute object's encoding's". Similarly for the rest.

I've added it to the accessors, but I will probably need to add it some of the algorithms as well.

domenic · 2018-07-31T16:04:41Z

encoding.bs

@@ -1038,6 +1051,33 @@ function decodeArrayOfStrings(buffer, encoding) {
 </div>


+<h3 id=interface-mixin-textdecoderattributes>Interface Mixin {{TextDecoderAttributes}}</h3>


Nit: lowercase "mixin" in headings, here and below. (Headings are sentence case.)

domenic · 2018-07-31T16:05:17Z

encoding.bs

+</pre>
+
+<p>An object that includes GenericTransformStream always has an associated
+<dfn for=GenericTransformStream>transform</dfn>.


Probably worth stating its type.

domenic · 2018-07-31T16:08:11Z

encoding.bs

+};
+</pre>
+
+<p>An object that includes GenericTransformStream always has an associated


{{GenericTransformStream}}

domenic · 2018-07-31T16:12:10Z

encoding.bs

+
+<p>A {{TextDecoderStream}} object also has an associated <dfn id=concept-tds-serialize
+for=TextDecoderStream>serialize stream</dfn> algorithm, that is identical to the <a
+for=TextDecoder>serialize stream</a> algorithm for the equivalent {{TextDecoder}}.


Can this move to TextDecoderAttributes since it's shared? Not sure myself.

I think I can do it if I rename TextDecoderAttributes to TextDecoderCommon and move more of the object's "slots" over. I won't have time to try it until next week.

domenic · 2018-07-31T16:13:27Z

encoding.bs

+  <p>Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a
+  {{ReadableStream}} source.
+
+  <pre class=example id=example-textdecoderstream-writable><code class=lang-javascript>


FWIW in Bikeshed you can use any indentation/whitespacing you want.

domenic · 2018-07-31T16:15:13Z

encoding.bs

+ promise rejected with a {{TypeError}} exception.
+
+ <li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>chunk</var> to
+ <var>dec</var>'s <a for=TextDecoderStream>stream</a>.


Hmm one day we should do a transfer variant, of this and all other web platform APIs...

domenic · 2018-07-31T16:16:47Z

encoding.bs

+<p class="note no-backref">A {{TextEncoderStream}} object offers no <var>label</var> argument as it
+only supports <a>UTF-8</a>.
+
+<hr>


Not so sure about this <hr>

Me neither. I've taken it out for now.

domenic · 2018-07-31T16:18:21Z

encoding.bs

+
+<p class=note>This is equivalent to the "<a>convert</a> a <a>JavaScript string</a> into a <a>scalar
+value string</a>" algorithm from the Infra Standard, but allows for surrogate pairs that are split
+between strings.


Add [[INFRA]] for a nice stylistic flourish

* Remove unnecessary link-defaults * Make headers sentence case * Linkify GenericTransformStream * Indicate the GenericTransformStream's *transform* is a TransformStream. * Add [INFRA] to the reference to the Infra Standard.

Remove the encoding, ignore BOM flag and error mode slots from TextDecoder and TextDecoderStream and reference the versions in TextDecoderAttributes instead. Also remove the "transform" slot from TextDecoderStream and TextEncoderStream and reference GenericTransformStream's version instead.

domenic · 2018-08-03T15:21:21Z

Did a re-review and it all looks great, pending further explorations into sharing things more (especially with regard to the encoder classes).

MattiasBuelens

I think I found a few issues. Quick overview:

The documentation for TextDecoderStream.writable claims that you can enqueue BufferSource chunks, but the spec text only allows ArrayBuffers. I believe the spec text should be updated.
We should avoid enqueuing empty strings (in TextDecoderStream) or empty Uint8Arrays (in TextEncoderStream).

MattiasBuelens · 2018-08-04T11:12:24Z

encoding.bs

+
+ <dt><code><var>decoder</var> . <a attribute for=GenericTransformStream>writable</a></code>
+ <dd>
+  <p>Returns a <a>writable stream</a> which accepts {{BufferSource}} chunks and runs them through


A BufferSource is a ArrayBufferView or an ArrayBuffer. However, step 1 of decode and enqueue a chunk only accepts non-detached non-shared ArrayBuffers:

If Type(chunk) is not Object, or chunk does not have an [[ArrayBufferData]] internal slot, or IsDetachedBuffer(chunk) is true, or IsSharedArrayBuffer(chunk) is true, then return a new promise rejected with a TypeError exception.

I believe this step should also accept an ArrayBufferView chunk that has a non-detached non-shared backing ArrayBuffer.

Thanks for catching this. My attempt to fix it (6a65ad5) is ugly, but hopefully between us we can find a way to say it nicely.

MattiasBuelens · 2018-08-04T11:53:37Z

encoding.bs

+  <li><p>Let <var>controller</var> be <var>dec</var>'s
+  <a for=GenericTransformStream>transform</a>.\[[transformStreamController]].
+
+  <li><p>Call <a abstract-op>TransformStreamDefaultControllerEnqueue</a>(<var>controller</var>,


Skip this step if outputChunk is the empty string. In this case, the prollyfill does handle it correctly.

MattiasBuelens · 2018-08-04T11:54:04Z

encoding.bs

+     <li><p>Let <var>outputChunk</var> be <var>output</var>,
+     <a lt="serialize stream" for=TextDecoderStream>serialized</a>.
+
+     <li><p>Call <a abstract-op>TransformStreamDefaultControllerEnqueue</a>(<var>controller</var>,


Skip this step if outputChunk is the empty string. For example, this could happen if chunk is empty, or if the decoder is in the middle of a multi-byte character.

MattiasBuelens · 2018-08-04T12:08:53Z

encoding.bs

+    <ol>
+     <li><p>Convert <var>output</var> into a byte sequence.
+
+     <li><p>Let <var>chunk</var> be a {{Uint8Array}} object wrapping an {{ArrayBuffer}} containing


Once again, skip this step and the following enqueue when output is an empty byte sequence. For example, this could happen if chunk is empty, but also if chunk consists of one single high surrogate.

ricea · 2018-08-06T14:52:13Z

@MattiasBuelens Thank you for the review.

We should avoid enqueuing empty strings (in TextDecoderStream) or empty Uint8Arrays (in TextEncoderStream).

I'm not completely sure about this. I originally preserved empty chunks on the assumption that users would expect that if they put an empty chunk in they would get an empty chunk out.

I implemented this at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill/compare/never-empty?expand=1. Please take a look at the test changes and see what you think of the changed semantics.

MattiasBuelens · 2018-08-06T16:05:03Z

@ricea I think I prefer the updated semantics.

I can't really think of a use case where you'd be interested in empty chunks/strings. I feel like it'd just make user code unnecessarily more complicated, e.g.:

stream
    .pipeThrough(new TextEncoderStream())
    .pipeThrough(new TransformStream({
        transform(chunk, controller) {
            if (chunk.byteLength === 0) {
                return; // what else would you do here?
            }
            // do stuff
        }
    }));

I'm not aware of any spec that enqueues empty chunks to a stream. Fetch doesn't do it:

12.1.1.1. If one or more bytes have been transmitted from response’s message body, then:
12.1.1.1.1. Let bytes be the transmitted bytes.
12.1.1.1.2. [...]
12.1.1.1.6. Enqueue a Uint8Array object wrapping an ArrayBuffer containing bytes to stream. [...]

MattiasBuelens · 2018-08-06T16:08:32Z

Regardless of whether we decide to allow enqueuing an empty chunk/string inside decode/encode and enqueue a chunk, we definitely should not enqueue an empty chunk inside flush and enqueue. 😉

Rename TextXcoderAlgorithms to TextXcoderCommon and move the "serialize stream" from TextDecoder to TextDecoderCommon so that it can be shared sanely with TextDecoderStream.

domenic · 2018-08-07T16:06:29Z

I think for ArrayBufferView we can lean on Web IDL more:

Let bufferSource be the result of converting chunk to a BufferSource.

Then "a copy of" works well.

ricea · 2018-08-08T05:21:27Z

I'm happy with this now.

@MattiasBuelens what do you think?

MattiasBuelens

Good call using convert to a BufferSource, looks much nicer this way! 😄

However, we now have an issue where detached buffers are no longer handled correctly. Should be easy enough to fix though.

MattiasBuelens · 2018-08-08T08:44:03Z

encoding.bs

+ <a lt="converted to an IDL value">converting</a> <var>chunk</var> to a {{BufferSource}}. If this
+ throws an exception, then return a promise rejected with that exception.
+
+ <li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>bufferSource</var> to


In 6a65ad5, we explicitly checked for IsDetachedBuffer(arrayBuffer) and returned a rejected promise if the buffer was detached. However, none of the three algorithms that make up convert to a BufferSource check for detached buffers.

When get a copy of the buffer source encounters a detached buffer, it throws a TypeError in step 7. This is not what we want: we want to return a rejected promise instead!

I think the easiest fix would be to add the following line to this step, similar to the previous step:

If this throws an exception, then return a promise rejected with that exception.

Great catch, thanks.

I did what you proposed. This has the added benefit that we are always in sync with the checks and behaviour provided by WebIDL.

We no longer have explicit language rejecting detached buffers, however the "get a copy of the buffer source" algorithm will throw exceptions for them. Convert those exception to rejections to get the effect of the check.

MattiasBuelens

One tiny nitpick left, but looks good to go for me.

MattiasBuelens · 2018-08-08T11:41:05Z

encoding.bs

@@ -1485,7 +1485,8 @@ byteReadable
 throws an exception, then return a promise rejected with that exception.

 <li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>bufferSource</var> to
- <var>dec</var>'s <a for=TextDecoderStream>stream</a>.
+ <var>dec</var>'s <a for=TextDecoderStream>stream</a>. If this throws an exception, then return a
+ promise rejected with that exception


Tiny nit: missing a period at the end of this sentence. 😛

annevk · 2018-08-08T14:59:57Z

I trust that you got the various technical aspects correctly here, especially after the careful review from others. I would like to ask you though that for the various <dfn> elements which you have changed the for attribute of (as needed) to give them an id attribute to preserve their existing ID. Not breaking those links seems worthwhile.

I also pushed a commit that fixes various other things I noticed while looking through the text.

annevk · 2018-08-08T15:01:35Z

I suggest we also add @MattiasBuelens to the Acknowledgments section for their help in reviewing.

Also stop explicitly importing "transform stream" as it is now exported from the streams standard.

ricea · 2018-08-09T15:03:55Z

I would like to ask you though that for the various elements which you have changed the for attribute of (as needed) to give them an id attribute to preserve their existing ID.

Please take a look at af33687 and let me know whether I have done this correctly.

hsivonen · 2018-08-10T11:54:39Z

Is there a reason why TextDecoderStream doesn't provide a mode with the semantics of the "decode" algorithm for BOM handling (i.e. use the encoding if there's no BOM but honor the BOM if there is one)? These semantics are necessary to consume text/plain and these semantics are hard for developers to get right in the streaming context.

Also, does one input chunk always result in one output chunk that corresponds to potential pending partial code unit sequence and the input chunk except for potential partial code unit sequence at the end of the input chunk? (If not, I'm a bit worried about Web content developing a dependency on browser-specific chunk boundaries where the boundaries are not supposed to mean anything.)

ricea · 2018-08-10T13:08:21Z

Is there a reason why TextDecoderStream doesn't provide a mode with the semantics of the "decode" algorithm for BOM handling

The short answer is "because TextDecoder doesn't". I don't know the historical reason why TextDecoder doesn't.

I agree that BOM sniffing is necessary to parse legacy content, and that it can be hard to get right. However, I'd like to discourage people from relying on this behaviour. I'd like to move away from any kind of heuristic behaviour and towards a world where everyone uses UTF-8, and that's what the default behaviour encourages.

If we find compelling use cases for easily reading legacy content we may need to support it in future, but my personal preference would be to leave it out until it is proven necessary.

Also, does one input chunk always result in one output chunk that corresponds to potential pending partial code unit sequence and the input chunk except for potential partial code unit sequence at the end of the input chunk?

Input chunk to output chunk correspondance is normally 1:1, except that we don't output empty chunks, and an extra chunk may be output at the end of the stream if we discover the input was incomplete.

(If not, I'm a bit worried about Web content developing a dependency on browser-specific chunk boundaries where the boundaries are not supposed to mean anything.)

The semantics are strictly greedy. Given the same chunks as input, every browser is required to have exactly the same chunk boundaries in the output.

ricea · 2018-08-10T13:13:42Z

@annevk Sorry that stray call to decode() crept in. I just re-checked against the IDL we are testing in the idlharness.html test and it matches. Unfortunately I can't just copy-and-paste because idlharness.js doesn't support mixins.

annevk · 2018-08-10T13:18:53Z

@ricea it does support mixins I think? Lots of IDL in https://github.com/web-platform-tests/wpt/tree/master/interfaces uses them.

ricea · 2018-08-10T13:28:14Z

@annevk Oh, I should have tried it!

I've switched my work-in-progress copy to using the IDL from the standard verbatim and it does work.

hsivonen · 2018-08-10T14:35:10Z

If we find compelling use cases for easily reading legacy content we may need to support it in future, but my personal preference would be to leave it out until it is proven necessary.
...
The semantics are strictly greedy. Given the same chunks as input, every browser is required to have exactly the same chunk boundaries in the output.

OK. Thanks.

ricea · 2018-08-10T19:41:22Z

@annevk I have been refining the tests. Would you like me to land them before landing this?

annevk · 2018-08-11T11:15:29Z

Great, just need a pointer to them, ideally they aren't merged until this is fully agreed on.

The other thing that's remaining here is implementer bugs: https://github.com/whatwg/meta/blob/master/MAINTAINERS.md#handling-pull-requests.

The standard change that adds these classes is whatwg/encoding#149.

ricea · 2018-08-13T10:47:59Z

Tests are at web-platform-tests/wpt#12430.

The Chrome bug at least exists: https://bugs.chromium.org/p/chromium/issues/detail?id=845427 😃

The standard change that adds these classes is whatwg/encoding#149.

annevk · 2018-08-14T13:46:24Z

I pushed some further nits. I think I can say that Mozilla is on board with this PR (per Henri, Till, and I), which brings us to two implementers.

@othermaciej @travisleithead any final feedback / concerns?

@ricea would you file implementation bugs against the other implementation as per my link above?

travisleithead · 2018-08-22T16:57:56Z

Looked it over and don't have any concerns. Great work! I searched for a bug in our external issue tracker, and didn't see one, but we do have Encoding API (the whole spec) noted as In Development, and you can perhaps track this request via: https://wpdev.uservoice.com/forums/257854-microsoft-edge-developer/suggestions/6558040-support-the-encoding-api if it helps :).

ricea · 2018-08-28T23:02:04Z

Sorry for the delay, I was on holiday.

Firefox issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1486949
Safari issue: https://bugs.webkit.org/show_bug.cgi?id=189066

I'm going to assume that https://wpdev.uservoice.com/forums/257854-microsoft-edge-developer/suggestions/6558040-support-the-encoding-api covering the whole Encoding API is sufficient for Edge, at least until they have some of it implemented.

The standard change that adds these classes is whatwg/encoding#149.

…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430

…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430 UltraBlame original commit: 515ad283f17891ca6a8f9f524faf2a3dfe6c52b0

This was referenced Jul 18, 2018

Make TextEncoder and TextDecoder be transform streams #127

Closed

Add Streams support #72

Closed

domenic reviewed Jul 31, 2018

View reviewed changes

ricea added 3 commits August 3, 2018 22:56

General fixes

886d920

* Remove unnecessary link-defaults * Make headers sentence case * Linkify GenericTransformStream * Indicate the GenericTransformStream's *transform* is a TransformStream. * Add [INFRA] to the reference to the Infra Standard.

Add "this object's" in three places

f04369c

MattiasBuelens suggested changes Aug 4, 2018

View reviewed changes

ricea added 3 commits August 7, 2018 21:45

Make TextDecoderStream accept ArrayBufferView types

6a65ad5

Never enqueue an empty chunk

343e148

Share the "serialize stream" algorithm

6623b0b

Rename TextXcoderAlgorithms to TextXcoderCommon and move the "serialize stream" from TextDecoder to TextDecoderCommon so that it can be shared sanely with TextDecoderStream.

Use WebIDL to convert chunks for decoding to BufferSource

0e64bf8

MattiasBuelens suggested changes Aug 8, 2018

View reviewed changes

Translate exceptions from "copy bytes" into rejections

407d85d

We no longer have explicit language rejecting detached buffers, however the "get a copy of the buffer source" algorithm will throw exceptions for them. Convert those exception to rejections to get the effect of the check.

MattiasBuelens approved these changes Aug 8, 2018

View reviewed changes

ricea and others added 2 commits August 8, 2018 21:03

Add missing period.

60854d8

address various nits

e5b36e1

ricea added 4 commits August 9, 2018 23:41

Remove no-longer-needed ECMASCRIPT refs section

12b5161

Also stop explicitly importing "transform stream" as it is now exported from the streams standard.

Add one missing word "object"

dd65b39

Add Mattias Buelens to the acknowledgements

66aaab3

Add ids to dfns that have moved to TextXcoderCommon

af33687

annevk added 2 commits August 10, 2018 13:43

formatting nits

29fdba4

TextDecoderStream should not have decode()

256a138

ricea added a commit to ricea/web-platform-tests that referenced this pull request Aug 13, 2018

Tests for TextEncoderStream and TextDecoderStream

3f4d37d

The standard change that adds these classes is whatwg/encoding#149.

ricea mentioned this pull request Aug 13, 2018

Tests for TextEncoderStream and TextDecoderStream web-platform-tests/wpt#12430

Merged

ricea added a commit to ricea/web-platform-tests that referenced this pull request Aug 14, 2018

Tests for TextEncoderStream and TextDecoderStream

1e3736c

The standard change that adds these classes is whatwg/encoding#149.

more nits

b9dfb01

annevk pushed a commit to web-platform-tests/wpt that referenced this pull request Aug 29, 2018

Encoding: TextEncoderStream and TextDecoderStream

3ede662

The standard change that adds these classes is whatwg/encoding#149.

annevk merged commit c3e3887 into whatwg:master Aug 29, 2018

ricea deleted the new-stream-support branch September 3, 2018 12:45

		@@ -1038,6 +1051,33 @@ function decodeArrayOfStrings(buffer, encoding) {
		</div>


		<h3 id=interface-mixin-textdecoderattributes>Interface Mixin {{TextDecoderAttributes}}</h3>

Add TextEncoderStream and TextDecoderStream transform streams #149

Add TextEncoderStream and TextDecoderStream transform streams #149

Conversation

ricea commented Jul 18, 2018 • edited by pr-preview bot Loading

domenic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domenic commented Aug 3, 2018

MattiasBuelens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattiasBuelens Aug 4, 2018 • edited Loading

Choose a reason for hiding this comment

MattiasBuelens Aug 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricea commented Aug 6, 2018

MattiasBuelens commented Aug 6, 2018 • edited Loading

MattiasBuelens commented Aug 6, 2018 • edited Loading

domenic commented Aug 7, 2018

ricea commented Aug 8, 2018

MattiasBuelens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattiasBuelens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annevk commented Aug 8, 2018

annevk commented Aug 8, 2018

ricea commented Aug 9, 2018

hsivonen commented Aug 10, 2018

ricea commented Aug 10, 2018

ricea commented Aug 10, 2018

annevk commented Aug 10, 2018

ricea commented Aug 10, 2018

hsivonen commented Aug 10, 2018

ricea commented Aug 10, 2018

annevk commented Aug 11, 2018

ricea commented Aug 13, 2018

annevk commented Aug 14, 2018

travisleithead commented Aug 22, 2018

ricea commented Aug 28, 2018

ricea commented Jul 18, 2018 •

edited by pr-preview bot

Loading

MattiasBuelens Aug 4, 2018 •

edited

Loading

MattiasBuelens Aug 4, 2018 •

edited

Loading

MattiasBuelens commented Aug 6, 2018 •

edited

Loading

MattiasBuelens commented Aug 6, 2018 •

edited

Loading