Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing an erroneous HTML tag #2259

Open
donHenaro opened this issue Jan 14, 2025 · 2 comments
Open

parsing an erroneous HTML tag #2259

donHenaro opened this issue Jan 14, 2025 · 2 comments

Comments

@donHenaro
Copy link

Good day to all.
In jsoup v. 1.18.2 and 1.18.3, there was a problem with parsing an erroneous HTML tag, for example: <figcaption
(if it is not closed on the right '>').

Example:
Document document = Jsoup.parser(content, "", Parser.xml Parser());

The input is an HTML fragment:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.</span>
    </figcaption>
</figure>

In previous versions (<= 1.18.1), the parser automatically fixed this, but in the current version it also cuts off the closing tag.
In the document we get:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.
    </figcaption<span>
  </img>
</figure>
@jhy
Copy link
Owner

jhy commented Jan 15, 2025

Hi there,

I can't repro your specific output, can you double check your input and output? To make sure we're looking at the same thing.

Using your input:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.</span>
    </figcaption>
</figure>

With the HTML parser, we get:

  <figure>
   <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
   <figcaption <span contenteditable="false">
    Image 3.
   </figcaption>
  </figure>

Note that is an element figcaption with an attribute <span. This changed in 1.18.2 in the Tokenizer; we used to create a new tag but now allow it in the attribute.

With the XML parser, we get:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption _span="" contenteditable="false">Image 3.
    </figcaption>
</img></figure>

The changes to the HTML parser in 1.18.2 from the changelog:

  • Follow the current HTML specification in the tokenizer to allow < as part of a tag name, instead of emitting it as a
    character node. 2230
  • Similarly, allow a < as the start of an attribute name, vs creating a new element. The previous behavior was
    intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
    how browsers behave. 1483

For input of HTML (not your example, but related):

<figcaption<span>Foo

We get the element figcaption<span, so the serialization is:

<figcaption<span>Foo</figcaption<span>

Which is weird, but is the HTML spec, and what current browsers do. #2230 changed to this behavior from the previous because on balance, too many issues were created by deviating from the spec. The optimist in me hopes that, because browsers will render it differently from the author's intent, those folks will review and fix their HTML.

Now I think it probably is an issue that when using the XML parser / serializer, we output the element as <figcaption<span>, it would be better as <figcaption_span>, like how we normalize the attribute <span to _span.

@donHenaro
Copy link
Author

donHenaro commented Jan 15, 2025

Thank you for your concern, the problem is small, we have already fixed the editor.
My example didn't work because of the pretty formatting.
Input HTML:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption<span contenteditable="false">Image 3.</span></figcaption>
</figure>

we get:

<figure>
 <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
  <figcaption<span contenteditable="false">
   Image 3.
  </figcaption<span>
 </img>
</figure>

Perhaps the Jsoup perceives <figcaption<span> as a whole tag and does not consider the presence of "<" inside it to be a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants