parsing an erroneous HTML tag #2259

donHenaro · 2025-01-14T13:56:13Z

Good day to all.
In jsoup v. 1.18.2 and 1.18.3, there was a problem with parsing an erroneous HTML tag, for example: <figcaption
(if it is not closed on the right '>').

Example:
Document document = Jsoup.parser(content, "", Parser.xml Parser());

The input is an HTML fragment:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.</span>
    </figcaption>
</figure>

In previous versions (<= 1.18.1), the parser automatically fixed this, but in the current version it also cuts off the closing tag.
In the document we get:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.
    </figcaption<span>
  </img>
</figure>

The text was updated successfully, but these errors were encountered:

jhy · 2025-01-15T01:07:26Z

Hi there,

I can't repro your specific output, can you double check your input and output? To make sure we're looking at the same thing.

Using your input:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption
      <span contenteditable="false">Image 3.</span>
    </figcaption>
</figure>

With the HTML parser, we get:

  <figure>
   <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
   <figcaption <span contenteditable="false">
    Image 3.
   </figcaption>
  </figure>

Note that is an element figcaption with an attribute <span. This changed in 1.18.2 in the Tokenizer; we used to create a new tag but now allow it in the attribute.

With the XML parser, we get:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption _span="" contenteditable="false">Image 3.
    </figcaption>
</img></figure>

The changes to the HTML parser in 1.18.2 from the changelog:

Follow the current HTML specification in the tokenizer to allow < as part of a tag name, instead of emitting it as a
character node. 2230
Similarly, allow a < as the start of an attribute name, vs creating a new element. The previous behavior was
intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. 1483

For input of HTML (not your example, but related):

<figcaption<span>Foo

We get the element figcaption<span, so the serialization is:

<figcaption<span>Foo</figcaption<span>

Which is weird, but is the HTML spec, and what current browsers do. #2230 changed to this behavior from the previous because on balance, too many issues were created by deviating from the spec. The optimist in me hopes that, because browsers will render it differently from the author's intent, those folks will review and fix their HTML.

Now I think it probably is an issue that when using the XML parser / serializer, we output the element as <figcaption<span>, it would be better as <figcaption_span>, like how we normalize the attribute <span to _span.

donHenaro · 2025-01-15T13:54:38Z

Thank you for your concern, the problem is small, we have already fixed the editor.
My example didn't work because of the pretty formatting.
Input HTML:

<figure>
  <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
    <figcaption<span contenteditable="false">Image 3.</span></figcaption>
</figure>

we get:

<figure>
 <img src="api/files/16f0c553-4d76-4411-84b7-5049fe01bbe0">
  <figcaption<span contenteditable="false">
   Image 3.
  </figcaption<span>
 </img>
</figure>

Perhaps the Jsoup perceives <figcaption<span> as a whole tag and does not consider the presence of "<" inside it to be a mistake.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing an erroneous HTML tag #2259

parsing an erroneous HTML tag #2259

donHenaro commented Jan 14, 2025

jhy commented Jan 15, 2025

donHenaro commented Jan 15, 2025 •

edited

Loading

parsing an erroneous HTML tag #2259

parsing an erroneous HTML tag #2259

Comments

donHenaro commented Jan 14, 2025

jhy commented Jan 15, 2025

donHenaro commented Jan 15, 2025 • edited Loading

donHenaro commented Jan 15, 2025 •

edited

Loading