-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TextEncoderStream and TextDecoderStream transform streams #149
Conversation
Integrate with the streams standard by adding TextEncoderStream and TextDecoderStream transform streams to the standard. These enable binary<>text conversions on a ReadableStream using the `pipeThrough()` method (see https://streams.spec.whatwg.org/#rs-pipe-through). A TextEncoderStream object can be used to transform a stream of strings to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can be used to transform a stream of bytes in the encoding passed to the constructor to strings. There is a prollyfill and tests for the new functionality at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill. Closes whatwg#72.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry on the delay on this. Will be more responsive in the future.
I left editorial comments about spec organization. The algorithms seem solid to me from the streams perspective, but I am not an expert on the encoding side of things.
Overall this looks pretty good. Hopefully @annevk can review as well.
encoding.bs
Outdated
@@ -15,6 +15,19 @@ Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeo | |||
spec:infra; type:dfn; | |||
text:code point | |||
text:ascii case-insensitive | |||
spec:streams; | |||
type:interface; text:ReadableStream | |||
type:dfn; text:chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should export these in streams so that this is not necessary, for the dfns at least. For the interface, that's a separate issue: whatwg/fetch#780 so we can leave it here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All except "transform stream" were already exported. I created whatwg/streams#949 to export that too.
encoding.bs
Outdated
and <dfn for=TextDecoderAttributes>error mode</dfn>. | ||
|
||
<p>The <dfn attribute for=TextDecoderAttributes><code>encoding</code></dfn> attribute's getter must | ||
return <a for=TextDecoderAttributes>encoding</a>'s <a for=encoding>name</a> in <a>ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While moving things around maybe use the modern phrasing "this object's encoding's" or "this TextDecoderAttribute object's encoding's". Similarly for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it to the accessors, but I will probably need to add it some of the algorithms as well.
encoding.bs
Outdated
@@ -1038,6 +1051,33 @@ function decodeArrayOfStrings(buffer, encoding) { | |||
</div> | |||
|
|||
|
|||
<h3 id=interface-mixin-textdecoderattributes>Interface Mixin {{TextDecoderAttributes}}</h3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: lowercase "mixin" in headings, here and below. (Headings are sentence case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
encoding.bs
Outdated
</pre> | ||
|
||
<p>An object that includes GenericTransformStream always has an associated | ||
<dfn for=GenericTransformStream>transform</dfn>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth stating its type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
encoding.bs
Outdated
}; | ||
</pre> | ||
|
||
<p>An object that includes GenericTransformStream always has an associated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{GenericTransformStream}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
encoding.bs
Outdated
|
||
<p>A {{TextDecoderStream}} object also has an associated <dfn id=concept-tds-serialize | ||
for=TextDecoderStream>serialize stream</dfn> algorithm, that is identical to the <a | ||
for=TextDecoder>serialize stream</a> algorithm for the equivalent {{TextDecoder}}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this move to TextDecoderAttributes since it's shared? Not sure myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can do it if I rename TextDecoderAttributes to TextDecoderCommon and move more of the object's "slots" over. I won't have time to try it until next week.
<p>Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a | ||
{{ReadableStream}} source. | ||
|
||
<pre class=example id=example-textdecoderstream-writable><code class=lang-javascript> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in Bikeshed you can use any indentation/whitespacing you want.
encoding.bs
Outdated
promise rejected with a {{TypeError}} exception. | ||
|
||
<li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>chunk</var> to | ||
<var>dec</var>'s <a for=TextDecoderStream>stream</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm one day we should do a transfer variant, of this and all other web platform APIs...
encoding.bs
Outdated
<p class="note no-backref">A {{TextEncoderStream}} object offers no <var>label</var> argument as it | ||
only supports <a>UTF-8</a>. | ||
|
||
<hr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not so sure about this <hr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither. I've taken it out for now.
encoding.bs
Outdated
|
||
<p class=note>This is equivalent to the "<a>convert</a> a <a>JavaScript string</a> into a <a>scalar | ||
value string</a>" algorithm from the Infra Standard, but allows for surrogate pairs that are split | ||
between strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add [[INFRA]]
for a nice stylistic flourish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Remove unnecessary link-defaults * Make headers sentence case * Linkify GenericTransformStream * Indicate the GenericTransformStream's *transform* is a TransformStream. * Add [INFRA] to the reference to the Infra Standard.
Remove the encoding, ignore BOM flag and error mode slots from TextDecoder and TextDecoderStream and reference the versions in TextDecoderAttributes instead. Also remove the "transform" slot from TextDecoderStream and TextEncoderStream and reference GenericTransformStream's version instead.
Did a re-review and it all looks great, pending further explorations into sharing things more (especially with regard to the encoder classes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a few issues. Quick overview:
- The documentation for
TextDecoderStream.writable
claims that you can enqueueBufferSource
chunks, but the spec text only allowsArrayBuffer
s. I believe the spec text should be updated. - We should avoid enqueuing empty strings (in
TextDecoderStream
) or emptyUint8Array
s (inTextEncoderStream
).
|
||
<dt><code><var>decoder</var> . <a attribute for=GenericTransformStream>writable</a></code> | ||
<dd> | ||
<p>Returns a <a>writable stream</a> which accepts {{BufferSource}} chunks and runs them through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A BufferSource
is a ArrayBufferView
or an ArrayBuffer
. However, step 1 of decode and enqueue a chunk only accepts non-detached non-shared ArrayBuffer
s:
- If Type(chunk) is not Object, or chunk does not have an [[ArrayBufferData]] internal slot, or IsDetachedBuffer(chunk) is true, or IsSharedArrayBuffer(chunk) is true, then return a new promise rejected with a TypeError exception.
I believe this step should also accept an ArrayBufferView
chunk that has a non-detached non-shared backing ArrayBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. My attempt to fix it (6a65ad5) is ugly, but hopefully between us we can find a way to say it nicely.
encoding.bs
Outdated
<li><p>Let <var>controller</var> be <var>dec</var>'s | ||
<a for=GenericTransformStream>transform</a>.\[[transformStreamController]]. | ||
|
||
<li><p>Call <a abstract-op>TransformStreamDefaultControllerEnqueue</a>(<var>controller</var>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip this step if outputChunk
is the empty string. In this case, the prollyfill does handle it correctly.
encoding.bs
Outdated
<li><p>Let <var>outputChunk</var> be <var>output</var>, | ||
<a lt="serialize stream" for=TextDecoderStream>serialized</a>. | ||
|
||
<li><p>Call <a abstract-op>TransformStreamDefaultControllerEnqueue</a>(<var>controller</var>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip this step if outputChunk
is the empty string. For example, this could happen if chunk
is empty, or if the decoder is in the middle of a multi-byte character.
encoding.bs
Outdated
<ol> | ||
<li><p>Convert <var>output</var> into a byte sequence. | ||
|
||
<li><p>Let <var>chunk</var> be a {{Uint8Array}} object wrapping an {{ArrayBuffer}} containing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, skip this step and the following enqueue when output
is an empty byte sequence. For example, this could happen if chunk
is empty, but also if chunk
consists of one single high surrogate.
@MattiasBuelens Thank you for the review.
I'm not completely sure about this. I originally preserved empty chunks on the assumption that users would expect that if they put an empty chunk in they would get an empty chunk out. I implemented this at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill/compare/never-empty?expand=1. Please take a look at the test changes and see what you think of the changed semantics. |
@ricea I think I prefer the updated semantics. I can't really think of a use case where you'd be interested in empty chunks/strings. I feel like it'd just make user code unnecessarily more complicated, e.g.: stream
.pipeThrough(new TextEncoderStream())
.pipeThrough(new TransformStream({
transform(chunk, controller) {
if (chunk.byteLength === 0) {
return; // what else would you do here?
}
// do stuff
}
})); I'm not aware of any spec that enqueues empty chunks to a stream. Fetch doesn't do it:
|
Regardless of whether we decide to allow enqueuing an empty chunk/string inside decode/encode and enqueue a chunk, we definitely should not enqueue an empty chunk inside flush and enqueue. 😉 |
Rename TextXcoderAlgorithms to TextXcoderCommon and move the "serialize stream" from TextDecoder to TextDecoderCommon so that it can be shared sanely with TextDecoderStream.
I think for ArrayBufferView we can lean on Web IDL more:
Then "a copy of" works well. |
I'm happy with this now. @MattiasBuelens what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call using convert to a BufferSource, looks much nicer this way! 😄
However, we now have an issue where detached buffers are no longer handled correctly. Should be easy enough to fix though.
<a lt="converted to an IDL value">converting</a> <var>chunk</var> to a {{BufferSource}}. If this | ||
throws an exception, then return a promise rejected with that exception. | ||
|
||
<li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>bufferSource</var> to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6a65ad5, we explicitly checked for IsDetachedBuffer(arrayBuffer)
and returned a rejected promise if the buffer was detached. However, none of the three algorithms that make up convert to a BufferSource check for detached buffers.
When get a copy of the buffer source encounters a detached buffer, it throws a TypeError
in step 7. This is not what we want: we want to return a rejected promise instead!
I think the easiest fix would be to add the following line to this step, similar to the previous step:
If this throws an exception, then return a promise rejected with that exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks.
I did what you proposed. This has the added benefit that we are always in sync with the checks and behaviour provided by WebIDL.
We no longer have explicit language rejecting detached buffers, however the "get a copy of the buffer source" algorithm will throw exceptions for them. Convert those exception to rejections to get the effect of the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny nitpick left, but looks good to go for me.
encoding.bs
Outdated
@@ -1485,7 +1485,8 @@ byteReadable | |||
throws an exception, then return a promise rejected with that exception. | |||
|
|||
<li><p><a>Push</a> a <a lt="get a copy of the buffer source">copy of</a> <var>bufferSource</var> to | |||
<var>dec</var>'s <a for=TextDecoderStream>stream</a>. | |||
<var>dec</var>'s <a for=TextDecoderStream>stream</a>. If this throws an exception, then return a | |||
promise rejected with that exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: missing a period at the end of this sentence. 😛
I trust that you got the various technical aspects correctly here, especially after the careful review from others. I would like to ask you though that for the various I also pushed a commit that fixes various other things I noticed while looking through the text. |
I suggest we also add @MattiasBuelens to the Acknowledgments section for their help in reviewing. |
Also stop explicitly importing "transform stream" as it is now exported from the streams standard.
Please take a look at af33687 and let me know whether I have done this correctly. |
Is there a reason why Also, does one input chunk always result in one output chunk that corresponds to potential pending partial code unit sequence and the input chunk except for potential partial code unit sequence at the end of the input chunk? (If not, I'm a bit worried about Web content developing a dependency on browser-specific chunk boundaries where the boundaries are not supposed to mean anything.) |
The short answer is "because TextDecoder doesn't". I don't know the historical reason why TextDecoder doesn't. I agree that BOM sniffing is necessary to parse legacy content, and that it can be hard to get right. However, I'd like to discourage people from relying on this behaviour. I'd like to move away from any kind of heuristic behaviour and towards a world where everyone uses UTF-8, and that's what the default behaviour encourages. If we find compelling use cases for easily reading legacy content we may need to support it in future, but my personal preference would be to leave it out until it is proven necessary.
Input chunk to output chunk correspondance is normally 1:1, except that we don't output empty chunks, and an extra chunk may be output at the end of the stream if we discover the input was incomplete.
The semantics are strictly greedy. Given the same chunks as input, every browser is required to have exactly the same chunk boundaries in the output. |
@annevk Sorry that stray call to |
@ricea it does support mixins I think? Lots of IDL in https://github.com/web-platform-tests/wpt/tree/master/interfaces uses them. |
@annevk Oh, I should have tried it! I've switched my work-in-progress copy to using the IDL from the standard verbatim and it does work. |
OK. Thanks. |
@annevk I have been refining the tests. Would you like me to land them before landing this? |
Great, just need a pointer to them, ideally they aren't merged until this is fully agreed on. The other thing that's remaining here is implementer bugs: https://github.com/whatwg/meta/blob/master/MAINTAINERS.md#handling-pull-requests. |
The standard change that adds these classes is whatwg/encoding#149.
Tests are at web-platform-tests/wpt#12430. The Chrome bug at least exists: https://bugs.chromium.org/p/chromium/issues/detail?id=845427 😃 |
The standard change that adds these classes is whatwg/encoding#149.
I pushed some further nits. I think I can say that Mozilla is on board with this PR (per Henri, Till, and I), which brings us to two implementers. @othermaciej @travisleithead any final feedback / concerns? @ricea would you file implementation bugs against the other implementation as per my link above? |
Looked it over and don't have any concerns. Great work! I searched for a bug in our external issue tracker, and didn't see one, but we do have Encoding API (the whole spec) noted as In Development, and you can perhaps track this request via: https://wpdev.uservoice.com/forums/257854-microsoft-edge-developer/suggestions/6558040-support-the-encoding-api if it helps :). |
Sorry for the delay, I was on holiday. Firefox issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1486949 I'm going to assume that https://wpdev.uservoice.com/forums/257854-microsoft-edge-developer/suggestions/6558040-support-the-encoding-api covering the whole Encoding API is sufficient for Edge, at least until they have some of it implemented. |
The standard change that adds these classes is whatwg/encoding#149.
…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430
…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430
…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430 UltraBlame original commit: 515ad283f17891ca6a8f9f524faf2a3dfe6c52b0
…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430 UltraBlame original commit: 515ad283f17891ca6a8f9f524faf2a3dfe6c52b0
…derStream, a=testonly Automatic update from web-platform-testsEncoding: TextEncoderStream and TextDecoderStream The standard change that adds these classes is whatwg/encoding#149. -- wpt-commits: 3ede6629030918b00941c2fb7d176a18cbea16ea wpt-pr: 12430 UltraBlame original commit: 515ad283f17891ca6a8f9f524faf2a3dfe6c52b0
Integrate with the streams standard by adding TextEncoderStream and
TextDecoderStream transform streams to the standard. These enable
binary<>text conversions on a ReadableStream using the
pipeThrough()
method (see https://streams.spec.whatwg.org/#rs-pipe-through).
A TextEncoderStream object can be used to transform a stream of strings
to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can
be used to transform a stream of bytes in the encoding passed to the
constructor to strings.
There is a prollyfill and tests for the new functionality at
https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill.
Closes #72.
Preview | Diff