Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support document.write #6

Closed
kmcallister opened this issue Jul 31, 2014 · 7 comments
Closed

Support document.write #6

kmcallister opened this issue Jul 31, 2014 · 7 comments

Comments

@kmcallister
Copy link
Contributor

See servo/servo#3704.

The argument to document.write is a sequence of UCS-2 code units and we need a way to interface this with the UTF-8 parser. My plan is:

(Edit: Largely superseded by this proposal)

  • Convert to UTF-8 as soon as possible.
  • Convert invalid surrogate sequences to U+FFFD 'REPLACEMENT CHARACTER'. This is a deviation from the spec, but nobody has objected strongly in the course of various discussions. There was even talk of amending the spec to allow this behavior, since it's currently written under the assumption that all parsers use UCS-2 natively.
  • If a document.write input ends with a leading surrogate, we can't convert it yet, so save this single u16 in the BufferQueue alongside the UTF-8 buffers.
  • If a document.write input starts with a trailing surrogate, and there's a saved leading surrogate in the BufferQueue, then replace both with the appropriate Unicode character as UTF-8.
  • If the parser receives any other input and there's a saved leading surrogate, drop the saved surrogate and prepend U+FFFD to the input. (This means that a script split an invalid surrogate sequence across multiple document.write calls, or wrote a lone leading surrogate and then finished.)
@SimonSapin
Copy link
Member

As much as I’d like to, I don’t know that we can convince other implementations to replace lone surrogates with U+FFFD. For those that use UCS-2 internally (every one but us), this is pure overhead and has a performance cost.

And it’s not just document.write. Lone surrogates can end up anywhere in the DOM through APIs, and other browsers happily keep them there.

Another solution could be WTF-8: rust-lang/rust#12056 (comment). It’s a superset of UTF-8 (like UTF-8 is a superset of ASCII) that allows surrogates, but only if they’re unpaired. (Concatenating two WTF-8 strings is not just concatenating the bytes, but also needs to check for newly-paired surrogates at the boundary and converts them to the UTF-8 representation of a single code point.)

@kmcallister
Copy link
Contributor Author

Is it out of the question that the spec would allow but not mandate U+FFFD replacement? When I brought this up before people seemed to think it was enough of a corner case that we could get away with it (spec wording changes or no)

@SimonSapin
Copy link
Member

“Allow but not mandate” sounds bad for interop on principle, though I don’t know how much it really matters here. But even if we replace in document.write, surrogates can still get in through DOM or CSSOM APIs.

When this was brought up in CSS WG to replace in CSSOM, the conclusion was "no change". (Though it’s not clear to me the arguments for change were well represented then. I was in the meeting remotely in audio only with very bad sound quality.)

@SimonSapin
Copy link
Member

@SimonSapin
Copy link
Member

I’ve changed my mind on the above. I’d like Servo to try UTF-8 everywhere in the DOM and what you first suggested here for document.write.

http://www.mail-archive.com/[email protected]/msg00934.html

@kmcallister
Copy link
Contributor Author

https://github.com/kmcallister/tendril encompasses my latest proposal.

@nox
Copy link
Contributor

nox commented Nov 29, 2016

document.write landed.

@nox nox closed this as completed Nov 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants