[Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? #239

hayatoito · 2015-07-06T07:41:26Z

Title: [Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? (bugzilla: 22056)

Migrated from: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22056

comment: 0
comment_url: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22056#c0
Dominic Cooney wrote on 2013-05-16 06:29:42 +0000.

"is a sequence of alphanumeric ASCII characters"

This is confusing. NCName [1] includes combining characters and extenders that are not ASCII characters. These should be allowed, because custom element names MUST match the NCName production and there is no restriction on the character set.

I think "is a sequence of alphanumeric ASCII characters" should be "MUST be a sequence of ASCII characters".

[1] http://www.w3.org/TR/1999/REC-xml-names-19990114/#NT-NCName\

rniwa · 2016-03-01T05:04:38Z

The current specification doesn't even mention ASCII characters anywhere:

https://w3c.github.io/webcomponents/spec/custom/#dfn-custom-element-type

The custom element type identifies a custom element interface and is a sequence of characters that must match the NCName production [XML-NAMES], must contain a U+002D HYPHEN-MINUS character, and must not contain any uppercase ASCII letters [HTML].

I think it makes sense to restrict the tag name to ascii letters at least in v1.

rniwa · 2016-03-01T05:04:47Z

Any opinions? @annevk @travisleithead @hober

notwaldorf · 2016-03-01T06:25:11Z

I'd like to cast my vote for non-ascii in custom elements names! Emoji aside (https://jsbin.com/buzegi/edit?html,output) which is kind of cool, I think Kanji characters in tag names is a real use case 😊

Does the parser care if the tag name is non-ascii? Like, what makes it a hard problem (out of curiosity)?

rniwa · 2016-03-01T07:11:17Z

See the issue #177. There are disagreements on the exact set of characters allowed in tag names.

Someone needs to investigate the issue and come up with a safe/correct subset of characters that can be used in custom elements.

I'm suggesting to restrict it to only ASCII characters in v1 since it's always safe to expand the set of characters being allowed latter once someone has done that work but not vice versa.

rniwa · 2016-03-01T08:02:27Z

By the way, if we're allowing exotic non-ASCII characters like emoji, we probably don't need the hyphen requirement in those tag names since the requirement exists for the forward compatibility with future HTML documents, and I don't think we'd ever add an HTML element with an emoji in its tag name.

For example, '-' almost never appears in Chinese/Japanese, and it would look absolutely awful between Hanzi/Kanji/Katakana/Hiragana/etc...: Bad: マイ-エレメント Good: マイエレメント. Alternatively, we should allow full-width equivalent of hyphen such as http://unicode-table.com/en/30FB/.

rniwa · 2016-03-01T08:30:48Z

Also, if we do allow accented characters, would we allow capital accented letters? e.g. È is allowed but È (È) is disallowed in tag names? That would be rather confusing.

annevk · 2016-03-01T09:48:20Z

Per the HTML parser a tag name has to start with [a-z][A-Z]. However, once you get to the "tag name state", anything goes, except for ASCII whitespace, "/", ">", and U+0000.

I would be okay with requiring ASCII lowercase (with at least one hyphen) as a start and then go from there. I would also be fine with allowing more, but I don't think we should do anything that requires changing the rule that it starts with an ASCII alpha.

annevk · 2016-03-01T09:50:02Z

(See also whatwg/html#721 about making custom elements support self-closing syntax, just like SVG and MathML.)

chaals · 2016-03-01T15:43:31Z

I'm with Anne. Starting with something like x-джэц or my-日本酒 as legal seems reasonable enough, and leaving the HTML parser alone seems to be a Good Idea™ worth trying out in reality before we start messing with it.

rniwa · 2016-03-01T20:01:06Z

Per the HTML parser a tag name has to start with [a-z][A-Z]. However, once you get to the "tag name state", anything goes, except for ASCII whitespace, "/", ">", and U+0000.

That requirement doesn't exist in the XML parser so I'm inclined to say we should get rid of that requirement in the XML documents because it really doesn't meet the author expectation in non-European languages. This should be an important consideration in the parser extensibility issue #113.

Now, irrespective of HTML or XML documents, it doesn't make any sense to require - in the tag name when the tag name contains non-ASCII letters since there is no conceivable way that would become a forward compatibility problem with the future HTML specifications.

Again, my preference would be to require ASCII lowercase letters for the entire tag name in v1, and extend it carefully in the future. Since, in practice, even authors in Japan, China, etc... are going to use alphanumerical tag names in HTML documents to be consistent with other builtin elements.

Having said all those things, I have see two sensible options:

Require that all characters in a custom element tag name to be ASCII lowercase.
Define a strict subset of what document.createElement, HTML parser, and XML parser support, and then require a custom element tag name consists of only those letters with a leading ASCII character with an additional requirement that - be present when the tag name only contains alphanumeric letters.

annevk · 2016-03-02T08:39:31Z

First you say you don't want to constrain XML by the rules of HTML but then you say you want to use a subset of both.

1 coupled with hyphens is definitely the easiest option here.

(XML is constrained by https://www.w3.org/TR/xml-names/#NT-QName whereas createElement() is constrained by https://www.w3.org/TR/xml/#NT-Name. I think the former is a subset of the latter. But XML is also not consistently implemented across engines due to the fifth edition debacle and everyone mostly stopped caring for it.)

@domenic thoughts?

rniwa · 2016-03-02T08:53:28Z

Well, that's because you said you don't want to remove the leading ASCII letter requirement. I would want to remove that requirement in XML documents if we're allowing non-ASCII letters but I'd much rather come up with something everyone agree on than keep debating this.

On that ground, lowercase ASCII letters with hyphens is the easiest one to spec. IMO, we should just go with that and move on. There are too many other important issues to tackle for v1.

annevk · 2016-03-02T09:09:48Z

Oh I think you misunderstood. I simply explained how the HTML parser is constrained and that I don't think we should change the HTML parser. I did not mean to imply that should similarly constrain the local name of custom elements. But I'm happy with the simplest thing that could possibly work.

rniwa · 2016-03-02T09:30:54Z

Oh I see. Thanks for the clarification. We should just settle on whatever safest subset we can all agree on for v1.

domenic · 2016-03-02T14:08:29Z

I tend to agree with @rniwa that a restriction to ASCII letters in v1 makes sense. On the other hand, I was about to say "we could wait until developers ask for an expanded set and add them in the future", but then I realized @notwaldorf in this thread is a developer doing exactly that. So maybe we should be more permissive.

GIven how XML is a mess and I'd probably make document.defineElement just always fail in XML documents if I could, how about the following?

If context object is an XML document, validate that it contains a hyphen and only [a-zA-Z0-9]. (Should we disallow uppercase too?)
If context object is a HTML document, validate that:
- It matches https://www.w3.org/TR/xml/#NT-Name (createElement restriction)
- It either contains a hyphen, or contains a code point above 0xFF

annevk · 2016-03-02T16:06:46Z

We should disallow uppercase in XML. If we want to allow more in HTML, we should use QName from xml-names per createElementNS() since Name allows things that cannot appear in browser-implemented XML. A code point that is not [a-z-] should be enough I think to make it "I am custom".

domenic · 2016-03-02T21:28:01Z

If we want to allow more in HTML, we should use QName from xml-names per createElementNS() since Name allows things that cannot appear in browser-implemented XML.

I don't quite follow this reasoning. Why does stuff about browser-implemented HTML impact what we do in HTML documents?

A code point that is not [a-z-] should be enough I think to make it "I am custom".

So <form2> is custom? That's kind of neat.

In any case, you seem to have the best grasp on the restrictions here. With the guiding principles of:

We don't care how restrictive we are in XML; restrict as much as you want
We do want to allow as much as possible in HTML
In HTML, we don't want to require the hyphen if we're using "unusual enough" characters so that we know it's a custom element anyway

would you mind taking over the writing of the exact algorithm? Maybe even do it as a PR after #405 lands.

rniwa · 2016-03-02T21:33:31Z

No, if we're allowing non-ASCII characters, I want to remove the restriction that the leading letter must be a ASCII lowercase in XML documents because it just doesn't work well in languages that don't use latin alphabet.

domenic · 2016-03-02T21:33:52Z

@rniwa why do you care about XML documents?

rniwa · 2016-03-02T21:34:50Z

@domenic : I don't care whether I write HTML documents or XML documents. But, as an author, I would rather use XML documents to get around the annoyance that the leading letter must be a ASCII lowercase in Japanese for example. It just doesn't meet author expectation.

domenic · 2016-03-02T21:36:53Z

I'm confused. Why don't you just use HTML documents? That restriction doesn't exist there.

rniwa · 2016-03-02T21:37:49Z

@domenic : It totally does. The HTML5 parser requires that the leading letter of every tag name to be ASCII, and such is not the case in the XML parser.

domenic · 2016-03-02T21:40:05Z

Ah I see, sorry, I was looking at DOM instead of HTML. That position makes sense... but I assume that restriction is in the parser for a good reason. Probably to deal with things like <! and <[space] and <% and <?. Maybe we can allow [A-Za-z] plus anything greater than U+007F. (Probably with more random small subsets excluded per Name or QName.)

annevk · 2016-03-03T10:16:32Z

sigh

I'm not sure I want to work on this, there's five sets of names, as far as I can tell, of which three are used (with two of them arguably wrong):

HTML parser names. [a-z] for the first letter followed by pretty much anything.
xml 4th edition Name. I think most browsers use this for createElement().
xml 5th edition Name. Technically what browsers should use for createElement(), but don't. This is what allows emojis.
xml-names 2nd edition NCName. Used for createElementNS() and elements in the XML parser.
xml-names 3th edition NCName. Should be used for createElementNS() and elements in the XML parser.

I see two sane approaches here:

We restrict custom elements to ASCII alpha + ASCII hyphen.
We follow the restrictions from createElement() and createElementNS(), while ignoring that those are different from each other, from the HTML parser, and from what should be implemented for them per the latest XML specifications. (This requires no restriction to be specified and defers this mess to be cleaned up by the next generation, likely still us.)

annevk · 2016-03-03T10:18:20Z

There was talk at some point for trying to see if we could lift constraints on names altogether, but I don't think that ever happened. @foolip was the last to touch that potato.

rniwa · 2016-03-03T10:36:16Z

Option 2 seems rather risky because we could end up allowing names that can't be processed by HTML/XML parser and we may not even know about it. So I think we should go with option 1 for now. It's easy to expand the set of letters we can use later.

foolip · 2016-03-03T12:40:35Z

The previous discussion was in a Mismatch between HTML parser and createElement() et al thread on blink-dev, spawned from a Inconsistency in characters allowed in attribute names between setAttribute and HTML syntax specs spec bug.

ASCII alpha + ASCII hyphen seems like the safer option, really.

rniwa · 2016-03-03T17:59:51Z

We should probably file an issue in HTML and figure out "the one definition". I'm more than happy to use this definition once it's ready (even in v1) but I don't want to hold up the custom elements API on that.

annevk · 2016-03-03T18:03:09Z

"One or more a-z (lowercase), followed by a hyphen, followed by zero or more a-z (lowercase) or hyphen."

rniwa · 2016-03-03T18:06:04Z

Oh, I meant that the one definition that includes non-ASCII letters if that is even possible.

annevk · 2016-03-03T18:13:38Z

I see, it really depends on what the requirements are. Does it need to be supported by the HTML parser? Does it need to be supported by createElement()? Does it need to be supported by createElementNS()? Do we want emojis? If the answer to the first three is yes, you could have NCName, plus the limitation that the first code point is an ASCII alpha, plus that it must contain a hyphen. If you also want emojis, it might be good for browsers to implement the latest version of XML, etc.

rniwa · 2016-03-03T18:16:45Z

Well, I'm saying that we should probably figure this out for document.createElement and HTML/XML parser first (that is, define what a valid name is for all cases in single definition) before deciding what to do for custom elements. That's precisely why I've been suggesting to use ASCII lowercase + hyphen for now.

annevk · 2016-03-03T18:27:31Z

@rniwa oh, I don't think we can change HTML to allow non-a-z at the start. That would change the parsing of <†> and similar constructs. And we cannot change XML to require a-z at the start. We can have a common subset for custom elements, but we cannot have a common rule for all of them unless we start breaking things.

rniwa · 2016-03-03T18:56:42Z

I think you're still misunderstanding me. What I'm saying is that there should be one definition in one spec which defines what valid name means for HTML documents, which may refer to XML spec, and defines a set of valid names for HTML parser, XML parser, createDocument in HTML documents, and createDocument in XML documents. Hopefully there aren't many discrepancies between them but as you noted, they can't all be the same.

Now, if there is a known definitely safe subset of all those four potentially distinct sets that we can use for custom elements, then I'm all for it. But it sounded like there isn't, or they aren't even well defined yet. So it seems that we need to do the exercise of determining those four sets first before expanding the set of valid names allowed in custom elements

domenic · 2016-03-08T20:34:44Z

I went with a liberal-as-possible intersection set in 35086b3. See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-name for the rendered output.

We can work toward centralizing all definitions into one place (presumably DOM) later, and I think they will indeed all be distinct, but the definitions are already out there. I guess either DOM or browsers have a bug since DOM specifies XML 5th edition and browsers use XML 4th edition for createElement(NS). But for now custom elements will just use XML 5th edition like DOM does, and if we want to change both at once to align with browser reality (instead of making browsers more liberal) we can definitely do so.

annevk · 2016-03-08T20:45:44Z

Would it not be easier to say it needs to match NCName plus these other restrictions? I'm not sure introducing a whole new production is helpful here.

domenic · 2016-03-08T20:47:21Z

Hmm, I thought a production would be much easier to read/code against than taking a production and then using prose. The other restrictions get pretty hairy to the extent the new production is not really recognizable as a NCName.

annevk · 2016-03-09T07:43:08Z

The main thing is that browsers have code for an "NCName" check and everyone is vaguely familiar with it given createElementNS (and it's only a character different from createElement). So placing additional requirements beyond "NCName" makes it

Easier to see the delta.
Likely easier to implement.

domenic · 2016-03-09T16:35:11Z

I've added a non-normative note that should make it easier to see the delta. Hope that's clearer.

annevk · 2016-03-09T16:50:31Z

Thanks, that helps.

trusktr · 2016-04-15T16:51:02Z

By the way, if we're allowing exotic non-ASCII characters like emoji, we probably don't need the hyphen requirement in those tag names since the requirement exists for the forward compatibility with future HTML documents, and I don't think we'd ever add an HTML element with an emoji in its tag name.

Why are we so worried about custom element names conflicting with possible future tag names? (Not you specifically @rniwa

I propose that we should be allowed to override any element we wish, and in a per-shadow-root basis (not just on document):

// file1.js
import AwesomeImageElement from 'awesome-img'

const el = document.querySelector('#someEl')
const root = el.createShadowRoot()
root.registerElement('img', AwesomeImageElement)
const img = root.createElement('img') // creates an AwesomeImageElement instance
root.appendChild(img)

// file2.js
const el = document.querySelector('#otherEl')
const root = el.createShadowRoot()
const img = root.createElement('img') // creates an HTMLImageElement instance
root.appendChild(img)

If we allow overriding of native elements, then there will be no problem introducing native elements in the future; existing apps will continue to work, having their custom elements in place. It will also give developers more freedom and flexibility.

Please see the following threads for more details and examples:

chaals · 2016-04-15T18:11:32Z

@trusktr this comment is not really relevant to this issue. Having already raised the issue in question, please keep the technical discussion there, and avoid filling other issues with repeats of that information.

Places like twitter, blogs, and public discussions of ideas are other relevant places to look for support or discussion of your proposal. Filling up issue discussion isn't.

(If it were more relevant, a simple pointer would be enough. In this case, even that would probably be spammy).

Chaals (as chair)

trusktr · 2016-04-15T23:25:02Z

Hello @chaals, thanks for the tip!

https://html.spec.whatwg.org/multipage/scripting.html#valid-custom-element-name See WICG/webcomponents#239 for background.

Instead of requiring a name to match an existing HTML element, this relaxes the restrictions to: - starting with [a-zA-Z] (matching the HTML parser WICG/webcomponents#239 (comment)) - then continuing with anything other than a space, forward slash or closing angle bracket This is similar to the fix to the following issue in the HTML syntax highlighting repo (and actually depends on the "derivative" syntax that was created for that issue): textmate/html.tmbundle#92

hayatoito mentioned this issue Jul 6, 2015

Migrate the bugs filed for Custom Elements from bugzilla to GitHub Issues, here. #119

Closed

hayatoito added custom-elements and removed custom-elements labels Jul 6, 2015

rniwa added the v1 label Mar 1, 2016

rniwa mentioned this issue Mar 1, 2016

[Custom]: Restrict custom elements to NCName (bugzilla: 20973) #177

Closed

annevk changed the title ~~[Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? (bugzilla: 22056)~~ [Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? Mar 1, 2016

annevk mentioned this issue Mar 2, 2016

Large custom element spec rewrite to implement some F2F decisions #405

Merged

domenic added the needs consensus label Mar 8, 2016

domenic closed this as completed in 35086b3 Mar 8, 2016

treshugart mentioned this issue Mar 15, 2016

Valid custom element name updates. skatejs/skatejs#509

Closed

mathiasbynens added a commit to mathiasbynens/validate-element-name that referenced this issue Jun 21, 2016

Match the updated specification

0b7940b

https://html.spec.whatwg.org/multipage/scripting.html#valid-custom-element-name See WICG/webcomponents#239 for background.

mathiasbynens mentioned this issue Jun 21, 2016

Match the updated specification sindresorhus/validate-element-name#8

Merged

sindresorhus pushed a commit to sindresorhus/validate-element-name that referenced this issue Jun 21, 2016

Match the updated specification (#8)

55a0b1c

https://html.spec.whatwg.org/multipage/scripting.html#valid-custom-element-name See WICG/webcomponents#239 for background.

mathiasbynens added a commit to mathiasbynens/mothereff.in that referenced this issue Jun 21, 2016

custom-element-name: Update to the latest spec

0017cda

https://html.spec.whatwg.org/multipage/scripting.html#valid-custom-element-name See WICG/webcomponents#239 for background.

domenic mentioned this issue Sep 7, 2016

Consider restricting custom element names to ASCII whatwg/html#1754

Closed

karlhorky mentioned this issue Aug 18, 2019

Relax highlighting rules on HTML element names microsoft/vscode-markdown-tm-grammar#54

Merged

[Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? #239

[Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? #239

Comments

hayatoito commented Jul 6, 2015

rniwa commented Mar 1, 2016

rniwa commented Mar 1, 2016

notwaldorf commented Mar 1, 2016

rniwa commented Mar 1, 2016

rniwa commented Mar 1, 2016

rniwa commented Mar 1, 2016

annevk commented Mar 1, 2016

annevk commented Mar 1, 2016

chaals commented Mar 1, 2016

rniwa commented Mar 1, 2016

annevk commented Mar 2, 2016

rniwa commented Mar 2, 2016

annevk commented Mar 2, 2016

rniwa commented Mar 2, 2016

domenic commented Mar 2, 2016

annevk commented Mar 2, 2016

domenic commented Mar 2, 2016

rniwa commented Mar 2, 2016

domenic commented Mar 2, 2016

rniwa commented Mar 2, 2016

domenic commented Mar 2, 2016

rniwa commented Mar 2, 2016

domenic commented Mar 2, 2016

annevk commented Mar 3, 2016 • edited by zcorpan Loading

annevk commented Mar 3, 2016

rniwa commented Mar 3, 2016

foolip commented Mar 3, 2016

rniwa commented Mar 3, 2016

annevk commented Mar 3, 2016

rniwa commented Mar 3, 2016

annevk commented Mar 3, 2016

rniwa commented Mar 3, 2016

annevk commented Mar 3, 2016

rniwa commented Mar 3, 2016

domenic commented Mar 8, 2016

annevk commented Mar 8, 2016

domenic commented Mar 8, 2016

annevk commented Mar 9, 2016

domenic commented Mar 9, 2016

annevk commented Mar 9, 2016

trusktr commented Apr 15, 2016 • edited Loading

chaals commented Apr 15, 2016

trusktr commented Apr 15, 2016

annevk commented Mar 3, 2016 •

edited by zcorpan

Loading

trusktr commented Apr 15, 2016 •

edited

Loading