-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Streams support #72
Comments
I'd love PRs for https://github.com/inexorabletash/text-encoding too. hint hint |
I think the first issue to resolve is whether the API should be var stringReadable = byteReadable.pipeThrough(TextDecoder.stream()); or var stringReadable = byteReadable.pipeThrough(new TextDecoder()); My spec patch currently implements the first one. The second one is clearly more readable, but it makes it unclear what this does: var decoder = new TextDecoder();
decoder.decode(new Uint8Array([0xE0]), {stream: true});
setTimeout(() => decoder.decode(new Uint8Array([0xA0]), {stream: true}), 1);
var stringReadable = byteReadable.pipeThrough(decoder); @jakearchibald in 9224c4c#commitcomment-19186596 suggested that I expressed some other concerns about the second option in 9224c4c#commitcomment-19186596. |
The main problem with the first option as I see is that As for the concern about sharing logic with Streams with the second option, that should be doable. We can refactor both standards as needed. That leaves the performance issue which I don't feel qualified to comment on. |
@ricea can you talk a bit more about the performance issue? If the piped stream is enqueued using JS, won't there be a similar de-opt? What about: const decoder = new TextDecoder();
jsCreatedStream.pipeTo(decoder.writable, {preventClose: true}).then(() => {
return fetchStream.pipeThrough(decoder);
}); How would that be different from: const decoder = new TextDecoder();
decoder.decode(whatever, {stream: true});
fetchStream.pipeThrough(decoder); |
@jakearchibald Both ReadableStream and WritableStream contain a queue. A natural optimisation for TransformStream is to eliminate one or both of those queues. This works even if both ends of the pipe are JS. If either end of the pipe is an inbuilt stream then more exciting optimisations become possible. The object returned by Concretely speaking, in Chrome the implementation of TextDecoder is in C++ but TransformStream is due to be implemented in Javascript. My expectation is that general piping optimisations will apply to TextDecoder but optimisations specific to TransformStream are unlikely to be ported to TextDecoder unless there is specific demand. I don't mean to suggest this is a showstopper. It is just something I am interested in feedback on. With regards to: jsCreatedStream.pipeTo(decoder.writable, {preventClose: true}).then(() => {
return fetchStream.pipeThrough(decoder);
}); I can describe how I think it will work in the optimisation proof-of-concept that @tyoshino has been working on. The |
Thanks for the detailed explanation!
It could be. If that isn't possible, there could be a behind-the-scenes Ideally |
Either a subclass or shared implementation seems doable. Subclass might be workable here, but not sure it is for all objects that need to interoperate with streams in due course. |
There is a way to create it: it directly uses the TransformStream constructor. There's no magic going on here, just an algorithm for assembling the constructor arguments. In general I think a clean break between the stream interface and the old interface would be preferable, instead of a confusing mismash which has both APIs on the same object, and which raises questions about what happens when you mix them. (Answerable questions, but the answers are not necessarily intuitive.) |
I am leaning slightly towards the "overloaded class" approach because it will look better on a slide. In other words, this syntax: var stringReadable = byteReadable.pipeThrough(new TextDecoder()); with I think I can spec it relatively simply by delegating to a TransformStream. Whether or not to inherit from TransformStream can then be postponed to a separate discussion. The spec for TransformStream is not ready yet, so we don't need to make a decision immediately. I expect it will be ready in the first half of November. Incidentally, I made minor updates to the Stream patch at http://htmlpreview.github.io/?https://github.com/ricea/encoding-streams/blob/master/patch.html, however it still uses the |
Is it correct that this proposal automatically takes buffer reuse into account as stated in #69? If so, we should probably duplicate that issue against this one. |
Hmm. I think right now we haven't quite integrated readable byte streams into transform streams. See whatwg/streams#601 |
I had doubts whether byte support in TransformStream was useful when it didn't exist in WritableStream. But #69 provides a clear use-case. |
I have updated the proposed patch for the spec to make the TextEncoder & TextDecoder objects behave as stream transformers directly as we discussed here. Internally they delegate to a TransformStream object which provides the glue logic. @yhirano identified an issue during review. Consider the following code: let dec = new TextDecoder();
let writer = dec.writable.getWriter();
writer.write(incompleteByteSequence);
writer.releaseLock();
await doSomethingWith(dec.readable);
let what = dec.decode(moreBytes); Here The issue is that the contents of Our proposed solution is to make TextDecoder lock into a particular mode the first time This doesn't make the above code deterministic, but it means that if Thoughts? [1] This is a gross simplification. |
Can't we lock the moment you invoke |
The Streams Standard doesn't track whether a stream has ever been locked, just whether it is locked at the moment. It also doesn't export any hooks to notify observers when a stream is locked, so the Encoding Standard cannot do the bookkeeping itself. |
Hmm, I don't know enough about the particulars, but poking at the internals of streams, e.g., using https://streams.spec.whatwg.org/#is-writable-stream-locked, should be okay. Fetch does so too. |
We can tell if it's locked, but not if it's unlocked but was previously locked.
Wouldn't |
Fetch API uses "disturbed" property which is turned on when the first read is made and will never turned off, although I'm not a big fan of the property. Regarding
, I think having decode_called boolean in the decoder and having the following logic in the decode function would work.
|
@jakearchibald Imagine that async function doSomethingWith(readable) {
let writable = getAwesomeNativeStream();
await readable.pipeTo(writable, {preventCancel: true});
return;
}
|
So with |
@jakearchibald Do you mean like let dec = new TextDecoder();
let writer = dec.writable.getWriter();
writer.write(incompleteByteSequence);
writer.releaseLock();
await doSomethingWith(dec.readable);
dec.readable.pipeTo(someOtherWritable()); ? In this case what is written to
The way @yhirano explained it to me was that the nondeterministic behaviour is caused by |
Ahh, this is where my understanding was lacking. I thought In that case, the "mode locking" proposals sound good. |
I uploaded a prollyfill for this functionality at https://github.com/GoogleChrome/text-encode-transform-prollyfill. "prollyfill" because it implements a proposed change to the standard rather than a standard itself. I have tested it in Chrome Canary (with the experimental flag) and Safari Technology Preview. Neither of these browsers have TransformStream, but you can get a polyfill here: https://github.com/whatwg/streams/blob/transform-stream-polyfill/reference-implementation/contrib/transform-stream-polyfill.js. It doesn't yet incorporate the changes to way concurrent use of both APIs is prevented that we've discussed here. This doesn't matter unless you are planning on holding it wrong. |
Regarding locking, to me the question is whether we want to say that either API is layered on top of the other, or whether they are two separate APIs that just happen to share the same object. If we say that the TextDecoder.prototype.decode = function (input, options) {
const writer = this.writable.getWriter();
const reader = this.readable.getReader();
this.writer.write(input);
const output = this.reader._readSync(); // private API I guess
writer.releaseLock();
reader.releaseLock();
return output;
}; In this case the code in #72 (comment) seems fine to me, nondeterminism included. Am I missing something? (If they are separate APIs that happen to share the same object, then some kind of locking/disturbedness makes sense, I guess.) |
The non-determinism creates a footgun which in my opinion means we should strongly discourage mixing the APIs. I'm also skeptical of explaining the behaviour of decode() in terms of a fictional synchronous streams API. I feel it doesn't mesh well with the very concrete way that Encoding and Streams are specified. Of course, we could really add a synchronous API to Streams, but I think that would be a lose-lose-lose proposition for standard authors, implementers and developers alike. |
I mean, this seems like any other case where you use both a high-level API and a low-level API together to operate on the same underlying data. The interactions will be subtle and perhaps unpredictable. But you can get nondeterminism when using streams in lots of ways. E.g. the code feels analogous to let dec = new TransformStream();
let writer = dec.writable.getWriter();
writer.write(chunk1);
await doSomethingWith(dec.readable);
writer = dec.writable.getWriter();
writer.write(chunk2);
let reader = dec.readable.getReader();
let what = await dec.reader.read(); To be clear, here the contents of I'm not sure why we're concerned about #72 (comment), but not concerned about the above.
I guess I was more getting at the idea that they both could conceptually operate on the same underlying stream queues. As opposed to them being completely separate, but occupying the same object. I wasn't proposing actually adding a sync API. |
I wrote an explainer for the Streams integration, and uploaded it as a pull request: #143. I can put it somewhere else if here is not appropriate. |
Use constructors named TextDecoderStream and TextEncoderStream for the streaming versions. This is based on the agreement in discussion on the Encoding Standard, particularly whatwg/encoding#72 (comment). The TextEncoder and TextDecoder prototypes are no longer modified, since the new constructors are completely distinct objects. Creation of the underlying TransformStream is no longer done lazily, as there's no longer any concern of regressing performance for the method interfaces. Update the tests to use the new constructors. Remove tests that involved the encode() and decode() methods. Update the README and design documents for the new change.
Here's a strawman proposal for the new IDL: dictionary TextDecoderOptions {
boolean fatal = false;
boolean ignoreBOM = false;
};
dictionary TextDecodeOptions {
boolean stream = false;
};
interface mixin HasEncoding {
readonly attribute DOMString encoding;
};
interface mixin IsDecoder {
readonly attribute boolean fatal;
readonly attribute boolean ignoreBOM;
};
IsDecoder includes HasEncoding;
[Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options),
Exposed=(Window,Worker)]
interface TextDecoder {
USVString decode(optional BufferSource input, optional TextDecodeOptions options);
};
TextDecoder includes IsDecoder;
[Constructor,
Exposed=(Window,Worker)]
interface TextEncoder {
[NewObject] Uint8Array encode(optional USVString input = "");
};
TextEncoder includes HasEncoding;
interface mixin GeneralTransformStream {
readonly attribute ReadableStream readable;
readonly attribute WritableStream writable;
};
[Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options),
Exposed=(Window,Worker)]
interface TextDecoderStream {
};
TextDecoderStream includes IsDecoder;
TextDecoderStream includes GeneralTransformStream;
[Constructor,
Exposed=(Window,Worker)]
interface TextEncoderStream {
};
TextEncoderStream includes HasEncoding;
TextEncoderStream includes GeneralTransformStream; Correction welcome. Questions:
|
Interface mixins cannot include interface mixins, yet: whatwg/webidl#537. I'd really much prefer inheritance over |
Subclassing the "real" TransformStream would be difficult. From a spec point of view, we have to make WebIDL know about TransformStream. The simplest way I can think of would be to add a From an implementation point of view, it's probably easy when TransformStream is implemented using IDL and difficult otherwise. In Chrome, the IDL parser knows nothing about TransformStream, so we'd have to either special-case it in the IDL bindings, reimplement TransformStream in C++, or implement TextDecoderStream without using IDL bindings at all. From a design point of view, inheritance of implementation has led to tight coupling and poor maintainability in code bases I have worked on and it's not a practice I want to encourage. |
Sorry for bringing up the inheritance of implementation thing. That's a much wider issue that should be discussed elsewhere. I've discussed this extensively with @domenic and we see TransformStream as a convenience factory for a We're not looking to extend TransformStream to have extra attributes beyond |
If we go with duck typing, why does this use string properties rather than symbols (as iterators do)? I thought promises using |
I don't think duck-typing and symbols are very related. There are many duck-types throughout the platform (and the history of computing) based on string names. For example, everything on the platform that accepts a dictionary argument instead of a class. (Concrete example: DOMPointInit.) Symbols are mostly used for syntax-triggered protocols like iteration. Although I'm not sure the dividing line is that clearly thought out. |
It seems like this approach would also require I guess I'm wishing we had a bit more precedent or more principled vision for these kind of objects. Is it worth asking TC39? |
Well, I think TC39 is pretty far on the side of duck-typing, more so than we are comfortable with on the web platform. But the way I think of this is just that pipeThrough accepts a dictionary. This is generally what you want, to enable using it with cases like https://streams.spec.whatwg.org/#example-both. Indeed there would be a lot of optimizations under the hood, but that's kind of built in to the whole piping idea. (And symbols vs. strings wouldn't really change that.) |
Yeah, inheritance would though, and has a lot of precedent on the platform too. |
I don't think so, at least, not the types of optimizations we're thinking of for Chrome. For example, you can use the specialized knowledge that TextEncoderStream is synchronous to do optimizations far beyond what you could do on any generic TransformStream. The optimizations you could do on a generic TransformStream are pretty limited, in fact, given that generically speaking, TransformStream instances call out to user JavaScript code all the time. So optimizations will generally be done on specific, recognized, branded pairs of { readable, writable }. That could even include a WebSocketDuplexStream in the style of #example-both if we ever got around to adding that to the platform. |
I guess the theoretical problem that remains is if you have something that wants to accept either a transform stream or something else and something else happens to have similarly named properties. But I suppose in that case you just provide two methods. That would be a bit different from established precedent in web platform APIs though. |
In practice using the APIs I find I don't pass around TransformStreams. I either supply them directly to
Iterators do use string properties. An iterator is an object that has a Iterables use symbol properties. I assume this is because they have to avoid name collisions and problems with enumeration. I think a transform stream is more like an iterator than an iterable in this respect.
Implementations are not prevented from sharing optimisation infrastructure because we don't inherit from TransformStream. Because On the other hand, inheritance can get in the way of implementations that want to optimise by bypassing TransformStream machinery, as @domenic alluded to. |
I think we've learned that overloading is in general not a great idea on the platform. I can't recall any overloads of the form (X or object), certainly. |
If it's actually a TransformStream it wouldn't be (X or object). And I think overloading as in (Blob or DOMString) is totally fine. It's changing the meaning or number of arguments that's a rather dubious practice. |
I guess I don't really understand the point you're trying to make, then. But regardless, taking a step back, it still seems like TextEncoderStream being its own standalone class that conforms to the { readable, writable } pattern (i.e., the pattern that all lowercase-transform-stream-accepting APIs accept) should be OK? |
The point is that if someone wanted (TransformStream or SomeOtherThing) that wouldn't work, but they could have separate methods instead. And yeah, the current design seems okay to me. Thanks for going through the alternatives. |
New attempt at the IDL, getting closer to what I'd like to standardise. I've attempted to improve the naming. PTAL. dictionary TextDecoderOptions {
boolean fatal = false;
boolean ignoreBOM = false;
};
dictionary TextDecodeOptions {
boolean stream = false;
};
interface mixin TextEncoderAttributes {
readonly attribute DOMString encoding;
};
interface mixin TextDecoderAttributes {
readonly attribute DOMString encoding;
readonly attribute boolean fatal;
readonly attribute boolean ignoreBOM;
};
[Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options),
Exposed=(Window,Worker)]
interface TextDecoder {
USVString decode(optional BufferSource input, optional TextDecodeOptions options);
};
TextDecoder includes TextDecoderAttributes;
[Constructor,
Exposed=(Window,Worker)]
interface TextEncoder {
[NewObject] Uint8Array encode(optional USVString input = "");
};
TextEncoder includes TextEncoderAttributes;
interface mixin GenericTransformStream {
readonly attribute ReadableStream readable;
readonly attribute WritableStream writable;
};
[Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options),
Exposed=(Window,Worker)]
interface TextDecoderStream {
};
TextDecoderStream includes TextDecoderAttributes;
TextDecoderStream includes GeneralTransformStream;
[Constructor,
Exposed=(Window,Worker)]
interface TextEncoderStream {
};
TextEncoderStream includes GeneralTransformStream; |
I missed TextEncoderStream includes TextEncoderAttributes; |
This is not a strong preference on my part, but this seems to be a more aggressive mixin-based factoring than we have in other parts of the platform. Personally I might just collapse all the interface mixins into their interfaces. They don't save much typing, and I don't think we have the usual motivation of wanting their target interfaces to be unaware of changes to the mixin. |
Hmm, it seems okay to me. (And whenever we don't do this, as with |
I'm using mixins based on the theory that having the mixins will make the standard text less repetitive. Of course, I haven't written the text yet. Maybe we can defer judgement until we see what it looks like? |
Integrate with the streams standard by adding TextEncoderStream and TextDecoderStream transform streams to the standard. These enable binary<>text conversions on a ReadableStream using the `pipeThrough()` method (see https://streams.spec.whatwg.org/#rs-pipe-through). A TextEncoderStream object can be used to transform a stream of strings to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can be used to transform a stream of bytes in the encoding passed to the constructor to strings. There is a prollyfill and tests for the new functionality at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill. Closes whatwg#72.
I started a new pull request at #149 as the old one had become confusing. What I'd like feedback on first is whether the mixins are useful or not. I'm leaning towards saying they are not useful. |
Integrate with the streams standard by adding TextEncoderStream and TextDecoderStream transform streams to the standard. These enable byte<>string conversions on a ReadableStream using the pipeThrough() method (see https://streams.spec.whatwg.org/#rs-pipe-through). A TextEncoderStream object can be used to transform a stream of strings to a stream of bytes in UTF-8 encoding. A TextDecoderStream object can be used to transform a stream of bytes in the encoding passed to the constructor to strings. Tests: web-platform-tests/wpt#12430. There is a prollyfill and tests for the new functionality at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill. Closes #72.
In 9224c4c#commitcomment-19169480 @jakearchibald @tyoshino @ricea are having a discussion about adding https://streams.spec.whatwg.org/ to the Encoding Standard that I'd prefer us to continue in this issue (and the eventual pull request). That will make it easier to track for other folks participating in this repository. Hope that's okay.
The text was updated successfully, but these errors were encountered: