nbsp char in xml name allowed #578

wangyoutian · 2024-12-18T11:20:54Z

1. Description

if we append a char nbsp (0xa0) to element name, it's parsed normally without exception thrown.

eg:

<constituent></constituent  >note in the end tag before this sentence, char 0xA0, not 0x20, is appended<constituent></constituent>

will be parsed as one "constituent" element, not two.
And the problem is suppressed (which is not good), and it's hard to debug, as 0xA0 is visually indiscernible from 0x20.

2. Expectation

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z.

doesnot allow such chars in element name.

nor xml allows as stipulated in:

http://w3.org/TR/REC-xml/#NT-NameStartChar

;

Otherwise, it's hard to pin down the issue.

Solution?

Should we in documentation explicitly allow such chars or should we throw exception?

The text was updated successfully, but these errors were encountered:

JonathanMagnan · 2024-12-18T14:52:46Z

Hello @wangyoutian ,

It is possible for you to reproduce this issue in .NET Fiddle

I currently get 2 "consituent" elements on my side: https://dotnetfiddle.net/Yb9nBG

The end tag with the 0xA0 is simply ignored.

Best Regards,

Jon

wangyoutian · 2024-12-19T08:39:45Z

https://dotnetfiddle.net/5WSaR2

is the reproduced issue (see the "test 3" there)
, where:
var tex =<c></c\u00a0><c></c>;

An extra notable phenomenon:
if we replace the letter 'c' with 'a', then two elements are parsed out, as expected.
; see "test 4" there.

(the above code can also be found at:

https://github.com/nilnul/nilnul._html_._TEST_/blob/nilnul-pub/el/content/parse/nbsp/UnitTest1.cs

)

JonathanMagnan · 2024-12-19T13:48:58Z

Thank you ;)

JonathanMagnan · 2024-12-24T13:51:53Z

Hello @wangyoutian ,

What kind of behavior are you expecting? We currently have the same behavior as browsers like Firefox and Chrome.

Since this is an "EndTag" and doesn't have any corresponding "BeginTag", we simply ignore it and continue the logic. A div tag can be inside a div but an a tag cannot be inside an a tag, so they both have different behavior in the number of elements.

But I'm not expecting any kind of error to be thrown.

Let me know more as at this moment, I believe it works as intended.

wangyoutian · 2024-12-24T16:27:51Z

In some text input field, such as "textarea", in some webpage , when you input space(0x20), it will be converted to nbsp(0xa0).

So If one user intends to input some xml code in such text field, and inadvertently inputs a space(0x20) that is appended to the endtag name, then the space is converted into nbsp(0xa0).

The user would think it's still space, as visually the nbsp is indiscernible. And if it's indeed space (0x20), per the specification:

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z

8.1.2.2 End tags
End tags must have the following format:

The first character of an end tag must be a U+003C LESS-THAN SIGN character (<).
The second character of an end tag must be a U+002F SOLIDUS character (/).
The next few characters of an end tag must be the element's tag name.
After the tag name, there may be one or more space characters.
Finally, end tags must be closed by a U+003E GREATER-THAN SIGN character (>).

and also xml specification:
https://www.w3.org/TR/REC-xml/#NT-ETag

[42] ETag ::= '</' Name S? '>'

,where :
[3] S ::= (#x20 | #x9 | #xD | #xA)+

, it shall be parsed normally and we shall see two elements from our example mentioned above.

And if it's nbsp, per the specification, it's disallowed, and an exception shall be thrown, to warn the user that the so thought space(0x20) is indeed nbsp(0xa0). And this is the expected behavior in my opinion.

I am not sure about how firefox and chrome handle this. But there might be subtle difference between them, which renders the content that can be inspected by the user, and a library that treats the parsed document as data (which might be then fed into a rendering process that then, possibly suppressing the exception caught as a UI might usually do to cater as much as possible to the information needs of the user, displays the content to the user for inspection).

JonathanMagnan self-assigned this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nbsp char in xml name allowed #578

nbsp char in xml name allowed #578

wangyoutian commented Dec 18, 2024 •

edited

Loading

JonathanMagnan commented Dec 18, 2024

wangyoutian commented Dec 19, 2024 •

edited

Loading

JonathanMagnan commented Dec 19, 2024

JonathanMagnan commented Dec 24, 2024

wangyoutian commented Dec 24, 2024 •

edited

Loading

nbsp char in xml name allowed #578

nbsp char in xml name allowed #578

Comments

wangyoutian commented Dec 18, 2024 • edited Loading

1. Description

2. Expectation

Solution?

JonathanMagnan commented Dec 18, 2024

wangyoutian commented Dec 19, 2024 • edited Loading

JonathanMagnan commented Dec 19, 2024

JonathanMagnan commented Dec 24, 2024

wangyoutian commented Dec 24, 2024 • edited Loading

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z

and also xml specification: https://www.w3.org/TR/REC-xml/#NT-ETag

[42] ETag ::= '</' Name S? '>'

,where : [3] S ::= (#x20 | #x9 | #xD | #xA)+

wangyoutian commented Dec 18, 2024 •

edited

Loading

wangyoutian commented Dec 19, 2024 •

edited

Loading

wangyoutian commented Dec 24, 2024 •

edited

Loading

and also xml specification:
https://www.w3.org/TR/REC-xml/#NT-ETag

,where :
[3] S ::= (#x20 | #x9 | #xD | #xA)+