Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbsp char in xml name allowed #578

Open
wangyoutian opened this issue Dec 18, 2024 · 5 comments
Open

nbsp char in xml name allowed #578

wangyoutian opened this issue Dec 18, 2024 · 5 comments
Assignees

Comments

@wangyoutian
Copy link

wangyoutian commented Dec 18, 2024

1. Description

if we append a char nbsp (0xa0) to element name, it's parsed normally without exception thrown.

eg:

<constituent></constituent  >note in the end tag before this sentence, char 0xA0, not 0x20, is appended<constituent></constituent>

will be parsed as one "constituent" element, not two.
And the problem is suppressed (which is not good), and it's hard to debug, as 0xA0 is visually indiscernible from 0x20.

2. Expectation

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z.

doesnot allow such chars in element name.

nor xml allows as stipulated in:

http://w3.org/TR/REC-xml/#NT-NameStartChar

;

Otherwise, it's hard to pin down the issue.

Solution?

Should we in documentation explicitly allow such chars or should we throw exception?

@JonathanMagnan JonathanMagnan self-assigned this Dec 18, 2024
@JonathanMagnan
Copy link
Member

Hello @wangyoutian ,

It is possible for you to reproduce this issue in .NET Fiddle

I currently get 2 "consituent" elements on my side: https://dotnetfiddle.net/Yb9nBG

The end tag with the 0xA0 is simply ignored.

Best Regards,

Jon

@wangyoutian
Copy link
Author

wangyoutian commented Dec 19, 2024

https://dotnetfiddle.net/5WSaR2

is the reproduced issue (see the "test 3" there)
, where:
var tex =<c></c\u00a0><c></c>;

An extra notable phenomenon:
if we replace the letter 'c' with 'a', then two elements are parsed out, as expected.
; see "test 4" there.

(the above code can also be found at:

https://github.com/nilnul/nilnul._html_._TEST_/blob/nilnul-pub/el/content/parse/nbsp/UnitTest1.cs

)

@JonathanMagnan
Copy link
Member

Thank you ;)

@JonathanMagnan
Copy link
Member

Hello @wangyoutian ,

What kind of behavior are you expecting? We currently have the same behavior as browsers like Firefox and Chrome.

Since this is an "EndTag" and doesn't have any corresponding "BeginTag", we simply ignore it and continue the logic. A div tag can be inside a div but an a tag cannot be inside an a tag, so they both have different behavior in the number of elements.

But I'm not expecting any kind of error to be thrown.

Let me know more as at this moment, I believe it works as intended.

@wangyoutian
Copy link
Author

wangyoutian commented Dec 24, 2024

In some text input field, such as "textarea", in some webpage , when you input space(0x20), it will be converted to nbsp(0xa0).

So If one user intends to input some xml code in such text field, and inadvertently inputs a space(0x20) that is appended to the endtag name, then the space is converted into nbsp(0xa0).

The user would think it's still space, as visually the nbsp is indiscernible. And if it's indeed space (0x20), per the specification:

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z

8.1.2.2 End tags
End tags must have the following format:

The first character of an end tag must be a U+003C LESS-THAN SIGN character (<).
The second character of an end tag must be a U+002F SOLIDUS character (/).
The next few characters of an end tag must be the element's tag name.
After the tag name, there may be one or more space characters.
Finally, end tags must be closed by a U+003E GREATER-THAN SIGN character (>).

and also xml specification:
https://www.w3.org/TR/REC-xml/#NT-ETag

[42] ETag ::= '</' Name S? '>'

,where :
[3] S ::= (#x20 | #x9 | #xD | #xA)+

, it shall be parsed normally and we shall see two elements from our example mentioned above.

And if it's nbsp, per the specification, it's disallowed, and an exception shall be thrown, to warn the user that the so thought space(0x20) is indeed nbsp(0xa0). And this is the expected behavior in my opinion.

I am not sure about how firefox and chrome handle this. But there might be subtle difference between them, which renders the content that can be inspected by the user, and a library that treats the parsed document as data (which might be then fed into a rendering process that then, possibly suppressing the exception caught as a UI might usually do to cater as much as possible to the information needs of the user, displays the content to the user for inspection).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants