Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap Datanodes in CDATA when producing XML #1720

Merged
merged 3 commits into from
Oct 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Release 1.17.1 [PENDING]
`#replaceAll(operator)`. These methods update the original DOM, as well as the Elements list.
<https://github.com/jhy/jsoup/pull/2017>

* Bugfix: when outputting with XML syntax, HTML elements that were parsed as data nodes (<script> and <style>) should
be emitted as CDATA nodes, so that they can be parsed correctly by an XML parser.
<https://github.com/jhy/jsoup/pull/1720>

Release 1.16.2 [20-Oct-2023]
* Improvement: optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators
are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the
Expand Down
12 changes: 11 additions & 1 deletion src/main/java/org/jsoup/nodes/DataNode.java
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package org.jsoup.nodes;

import java.io.IOException;
import org.jsoup.nodes.Entities.EscapeMode;

/**
A data node, for contents of style, script tags etc, where contents should not show in text().
Expand Down Expand Up @@ -40,7 +41,16 @@ public DataNode setWholeData(String data) {

@Override
void outerHtmlHead(Appendable accum, int depth, Document.OutputSettings out) throws IOException {
accum.append(getWholeData()); // data is not escaped in return from data nodes, so " in script, style is plain
if (out.syntax() == Document.OutputSettings.Syntax.xml) {
// In XML mode, output data nodes as CDATA, so can parse as XML
accum
.append("<![CDATA[")
.append(getWholeData())
.append("]]>");
} else {
// In HTML, data is not escaped in return from data nodes, so " in script, style is plain
accum.append(getWholeData());
}
}

@Override
Expand Down
13 changes: 13 additions & 0 deletions src/test/java/org/jsoup/helper/W3CDomTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -345,5 +345,18 @@ public void canOutputHtmlWithoutNamespace() {
org.jsoup.nodes.TextNode jText = (TextNode) jDiv.childNode(0).childNode(0);
assertEquals(jText, textNode.getUserData(W3CDom.SourceProperty));
}

@Test public void canXmlParseCdataNodes() throws XPathExpressionException {
String html = "<p><script>1 && 2</script><style>3 && 4</style> 5 &amp;&amp; 6</p>";
org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
jdoc.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
String xml = jdoc.body().html();
assertTrue(xml.contains("<script><![CDATA[")); // as asserted in ElementTest
Document doc = parseXml(xml, false);
NodeList list = xpath(doc, "//script");
assertEquals(1, list.getLength());
Node script = list.item(0); // will be the cdata node
assertEquals("1 && 2", script.getTextContent());
}

}
22 changes: 22 additions & 0 deletions src/test/java/org/jsoup/nodes/ElementTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -2733,6 +2733,28 @@ void prettySerializationRoundTrips(Document.OutputSettings settings) {
assertEquals("Hello", parse.data());
}

@Test void datanodesOutputCdataInXhtml() {
String html = "<p><script>1 && 2</script><style>3 && 4</style> 5 &amp;&amp; 6</p>";
Document doc = Jsoup.parse(html); // parsed as HTML
String out = TextUtil.normalizeSpaces(doc.body().html());
assertEquals(html, out);
Element scriptEl = doc.expectFirst("script");
DataNode scriptDataNode = (DataNode) scriptEl.childNode(0);
assertEquals("1 && 2", scriptDataNode.getWholeData());

doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
String xml = doc.body().html();
assertEquals(
"<p><script><![CDATA[1 && 2]]></script><style><![CDATA[3 && 4]]></style> 5 &amp;&amp; 6</p>",
TextUtil.normalizeSpaces(xml));

Document xmlDoc = Jsoup.parse(xml, Parser.xmlParser());
assertEquals(xml, xmlDoc.html());
Element scriptXmlEl = xmlDoc.expectFirst("script");
CDataNode scriptCdata = (CDataNode) scriptXmlEl.childNode(0);
assertEquals(scriptCdata.text(), scriptDataNode.getWholeData());
}

@Test void outerHtmlAppendable() {
// tests not string builder flow
Document doc = Jsoup.parse("<div>One</div>");
Expand Down