-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: remove nested CDATA declarations in content (#231)
During OLX node creation, some XBlocks will wrap their entire contents with CDATA (e.g. the HTMLBlock in libraries). This can cause errors when that wrapped content contains its own CDATA sections, because CDATA sections cannot nest. This can happen in HTML documents where the contents of a <script> tag are sometimes enclosed in CDATA. Modern HTML does not require this, but it was a common practice when XHTML was an accepted standard, as this would make those documents easier to parse as XML. This commit will remove nested CDATA declarations while preserving their content, so that the top-level CDATA can work as expected.
- Loading branch information
1 parent
11104d4
commit 8e239ad
Showing
9 changed files
with
262 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
CDATA_PATTERN = r"<!\[CDATA\[(?P<content>.*?)\]\]>" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> | ||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> | ||
<head> | ||
<title>CDATA containing HTML document</title> | ||
</head> | ||
<body> | ||
<script type="text/javascript"> | ||
<![CDATA[ | ||
var htmlContent = "<div>Hello, world!</div>"; | ||
alert(htmlContent); | ||
]]> | ||
</script> | ||
</body> | ||
</html> |
15 changes: 15 additions & 0 deletions
15
tests/fixtures_data/html_files/cleaned-cdata-containing-html.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> | ||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> | ||
<head> | ||
<title>CDATA containing HTML document</title> | ||
</head> | ||
<body> | ||
<script type="text/javascript"> | ||
|
||
var htmlContent = "<div>Hello, world!</div>"; | ||
alert(htmlContent); | ||
|
||
</script> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> | ||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> | ||
<head> | ||
<title>HTML document without CDATA</title> | ||
</head> | ||
<body> | ||
<script type="text/javascript"> | ||
var htmlContent = "<div>Hello, world!</div>"; | ||
alert(htmlContent); | ||
</script> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
from cc2olx.utils import clean_from_cdata | ||
|
||
|
||
class TestXMLCleaningFromCDATA: | ||
""" | ||
Test XML string cleaning from CDATA sections. | ||
""" | ||
|
||
def test_cdata_containing_html_is_cleaned_successfully( | ||
self, | ||
cdata_containing_html: str, | ||
expected_cleaned_cdata_containing_html: str, | ||
) -> None: | ||
""" | ||
Test if CDATA tags are removed from HTML while their content is kept. | ||
Args: | ||
cdata_containing_html (str): HTML that contains CDATA tags. | ||
expected_cleaned_cdata_containing_html (str): Expected HTML after | ||
successful cleaning. | ||
""" | ||
actual_cleaned_cdata_containing_html = clean_from_cdata(cdata_containing_html) | ||
|
||
assert actual_cleaned_cdata_containing_html == expected_cleaned_cdata_containing_html | ||
|
||
def test_html_without_cdata_remains_the_same_after_cleaning(self, html_without_cdata: str) -> None: | ||
""" | ||
Test if HTML that doesn't contain CDATA tags remains the same. | ||
Args: | ||
html_without_cdata (str): HTML that doesn't contains CDATA tags. | ||
""" | ||
actual_cleaned_html_without_cdata = clean_from_cdata(html_without_cdata) | ||
|
||
assert actual_cleaned_html_without_cdata == html_without_cdata |