Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow WHATWG guidance for incorrectly-closed HTML comments #2058

Merged
merged 3 commits into from
Nov 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ We bump `Major.Minor.Patch` versions following this guidance:
`Minor`:

- Features and bugfixes.
- Dropping support for EOLed Ruby versions, updating vendored libraries.
- Updating vendored libraries for non-security-related reasons.
- Dropping support for EOLed Ruby versions, updating packaged libraries.
- Updating packaged libraries for non-security-related reasons.

`Patch`:

- Security updates, including updating vendored libraries for security-related reasons.
- Security updates, including updating packaged libraries for security-related reasons.
- Bugfixes.

---
Expand Down Expand Up @@ -91,6 +91,7 @@ This release ends support for:
* Avoid creation of unnecessary zero-length String objects. [[#1970](https://github.com/sparklemotion/nokogiri/issues/1970)] (Thanks, [@ashmaroli](https://github.com/ashmaroli)!)
* Always compile libxml2 and libxslt with '-O2' [[#2022](https://github.com/sparklemotion/nokogiri/issues/2022), [#2100](https://github.com/sparklemotion/nokogiri/issues/2100)] (Thanks, [@ilyazub](https://github.com/ilyazub)!)
* {HTML,XML}::Document#parse now accept `Pathname` objects. Previously this worked only if the referenced file was less than 4096 bytes long; longer files resulted in undefined behavior because the `read` method would be repeatedly invoked. [[#1821](https://github.com/sparklemotion/nokogiri/issues/1821), [#2110](https://github.com/sparklemotion/nokogiri/issues/2110)] (Thanks, [@doriantaylor](https://github.com/doriantaylor) and [@phokz](https://github.com/phokz)!)
* [CRuby] Packaged libxml2 follows WHATWG guidance for incorrectly-closed HTML comments. [[#2058](https://github.com/sparklemotion/nokogiri/issues/2058)] (Thanks to HackerOne user [mayflower](https://hackerone.com/mayflower?type=user) for reporting this!)
* [JRuby] Lots of code cleanup and performance improvements. [[#1934](https://github.com/sparklemotion/nokogiri/issues/1934)] (Thanks, [@kares](https://github.com/kares)!)
* [JRuby] Clean up deprecated calls into JRuby. [[#2027](https://github.com/sparklemotion/nokogiri/issues/2027)] (Thanks, [@headius](https://github.com/headius)!)

Expand Down Expand Up @@ -119,7 +120,7 @@ This release changes the information provided in
`Nokogiri::VersionInfo`, see [#1482](https://github.com/sparklemotion/nokogiri/issues/1482) and [#1974](https://github.com/sparklemotion/nokogiri/issues/1974) for background. Note that
the output of `nokogiri -v` will also reflect these changes.

`Nokogiri::VersionInfo` will no longer contain the following keys (previously these were set only when vendored libraries were being used)
`Nokogiri::VersionInfo` will no longer contain the following keys (previously these were set only when packaged libraries were being used)
* `libxml/libxml2_path`
* `libxml/libxslt_path`

Expand Down Expand Up @@ -149,7 +150,7 @@ These methods have been renamed and the return type changed from `String` to `Ge

`Nokogiri.uses_libxml?` now accepts an optional requirement string which is interpreted as a `Gem::Requirement` and tested against the loaded libxml2 version (the value in `VersionInfo` key `libxml/loaded`). This greatly simplifies much of the version-dependent branching logic in both the implementation and the tests.

To sum these changes up, the output from CRuby when using vendored libraries was something like:
To sum these changes up, the output from CRuby when using packaged libraries was something like:

```
# Nokogiri (1.10.7)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
From 15b28d8a99224e16ae8a7e7774ee3b6b6769dca3 Mon Sep 17 00:00:00 2001
From: Mike Dalessio <[email protected]>
Date: Mon, 3 Aug 2020 17:36:05 -0400
Subject: [PATCH] htmlParseComment: treat `--!>` as if it closed the comment

See guidance provided on incorrectly-closed comments here:

https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment

This is a minimal patch to try to avoid merge conflicts.
---
HTMLparser.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/HTMLparser.c b/HTMLparser.c
index 7b6d689..0046642 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3301,6 +3301,7 @@ htmlParseComment(htmlParserCtxtPtr ctxt) {
int r, rl;
int cur, l;
xmlParserInputState state;
+ int possiblyIncorrectlyClosed = 0;

/*
* Check that there is a comment right here.
@@ -3346,6 +3347,16 @@ htmlParseComment(htmlParserCtxtPtr ctxt) {
buf = tmp;
}
COPY_BUF(ql,buf,len,q);
+
+ if (possiblyIncorrectlyClosed && cur == '>') {
+ htmlParseErr(ctxt, XML_ERR_COMMENT_NOT_FINISHED,
+ "Comment incorrectly closed by '--!>'", NULL, NULL);
+ buf[len-2] = 0;
+ goto recover ;
+ }
+
+ possiblyIncorrectlyClosed = (q == '-' && r == '-' && cur == '!') ? 1 : 0 ;
+
q = r;
ql = rl;
r = cur;
@@ -3359,6 +3370,8 @@ htmlParseComment(htmlParserCtxtPtr ctxt) {
}
}
buf[len] = 0;
+
+recover:
if (IS_CHAR(cur)) {
NEXT;
if ((ctxt->sax != NULL) && (ctxt->sax->comment != NULL) &&
--
2.25.1

Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
From 73fb87d057c22abf7e2b417f559ad62ef7ed67c8 Mon Sep 17 00:00:00 2001
From: Mike Dalessio <[email protected]>
Date: Sun, 11 Oct 2020 13:50:56 -0400
Subject: [PATCH 1/2] rewrite htmlParseLookupSequence to accept an arbitrary
string

This simplifies the logic in the method and allows us to pass in
longer sequences.
---
HTMLparser.c | 95 +++++++++++++++++++---------------------------------
1 file changed, 34 insertions(+), 61 deletions(-)

diff --git a/HTMLparser.c b/HTMLparser.c
index 26b4c5d..123dff5 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -5128,25 +5128,19 @@ htmlCreateDocParserCtxt(const xmlChar *cur, const char *encoding) {
/**
* htmlParseLookupSequence:
* @ctxt: an HTML parser context
- * @first: the first char to lookup
- * @next: the next char to lookup or zero
- * @third: the next char to lookup or zero
+ * @sequence: a null-terminated array of bytes to lookup
* @comment: flag to force checking inside comments
*
- * Try to find if a sequence (first, next, third) or just (first next) or
- * (first) is available in the input stream.
+ * Try to find if a sequence of bytes is available in the input stream.
* This function has a side effect of (possibly) incrementing ctxt->checkIndex
* to avoid rescanning sequences of bytes, it DOES change the state of the
* parser, do not use liberally.
* This is basically similar to xmlParseLookupSequence()
*
- * Returns the index to the current parsing point if the full sequence
- * is available, -1 otherwise.
+ * Returns the index to the current parsing point if the full sequence is available, -1 otherwise.
*/
static int
-htmlParseLookupSequence(htmlParserCtxtPtr ctxt, xmlChar first,
- xmlChar next, xmlChar third, int iscomment,
- int ignoreattrval)
+htmlParseLookupSequence(htmlParserCtxtPtr ctxt, const char* sequence, int iscomment, int ignoreattrval)
{
int base, len;
htmlParserInputPtr in;
@@ -5154,14 +5148,18 @@ htmlParseLookupSequence(htmlParserCtxtPtr ctxt, xmlChar first,
int incomment = 0;
int invalue = 0;
char valdellim = 0x0;
+ int sequence_len = xmlStrlen((const xmlChar*)sequence);
+ int matched;

in = ctxt->input;
- if (in == NULL)
+ if (in == NULL) {
return (-1);
+ }

base = in->cur - in->base;
- if (base < 0)
+ if (base < 0) {
return (-1);
+ }

if (ctxt->checkIndex > base)
base = ctxt->checkIndex;
@@ -5174,11 +5172,7 @@ htmlParseLookupSequence(htmlParserCtxtPtr ctxt, xmlChar first,
len = xmlBufUse(in->buf->buffer);
}

- /* take into account the sequence length */
- if (third)
- len -= 2;
- else if (next)
- len--;
+ len = len + 1 - sequence_len;
for (; base < len; base++) {
if ((!incomment) && (base + 4 < len) && (!iscomment)) {
if ((buf[base] == '<') && (buf[base + 1] == '!') &&
@@ -5214,28 +5208,17 @@ htmlParseLookupSequence(htmlParserCtxtPtr ctxt, xmlChar first,
}
continue;
}
- if (buf[base] == first) {
- if (third != 0) {
- if ((buf[base + 1] != next) || (buf[base + 2] != third))
- continue;
- } else if (next != 0) {
- if (buf[base + 1] != next)
- continue;
+ for (matched = 0 ; matched < sequence_len ; matched++) {
+ if (buf[base+matched] != sequence[matched]) {
+ break;
}
+ }
+ if (matched == sequence_len) {
ctxt->checkIndex = 0;
#ifdef DEBUG_PUSH
- if (next == 0)
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c' found at %d\n",
- first, base);
- else if (third == 0)
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c%c' found at %d\n",
- first, next, base);
- else
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c%c%c' found at %d\n",
- first, next, third, base);
+ xmlGenericError(xmlGenericErrorContext,
+ "HPP: lookup '%s' found at %d\n",
+ sequence, base);
#endif
return (base - (in->cur - in->base));
}
@@ -5243,16 +5226,7 @@ htmlParseLookupSequence(htmlParserCtxtPtr ctxt, xmlChar first,
if ((!incomment) && (!invalue))
ctxt->checkIndex = base;
#ifdef DEBUG_PUSH
- if (next == 0)
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c' failed\n", first);
- else if (third == 0)
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c%c' failed\n", first, next);
- else
- xmlGenericError(xmlGenericErrorContext,
- "HPP: lookup '%c%c%c' failed\n", first, next,
- third);
+ xmlGenericError(xmlGenericErrorContext, "HPP: lookup '%s' failed\n", sequence);
#endif
return (-1);
}
@@ -5462,7 +5436,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
(UPP(6) == 'Y') && (UPP(7) == 'P') &&
(UPP(8) == 'E')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5508,7 +5482,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
if ((cur == '<') && (next == '!') &&
(in->cur[2] == '-') && (in->cur[3] == '-')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '-', '-', '>', 1, 1) < 0))
+ (htmlParseLookupSequence(ctxt, "-->", 1, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5518,7 +5492,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
ctxt->instate = XML_PARSER_MISC;
} else if ((cur == '<') && (next == '?')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5532,7 +5506,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
(UPP(6) == 'Y') && (UPP(7) == 'P') &&
(UPP(8) == 'E')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5568,7 +5542,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
if ((cur == '<') && (next == '!') &&
(in->cur[2] == '-') && (in->cur[3] == '-')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '-', '-', '>', 1, 1) < 0))
+ (htmlParseLookupSequence(ctxt, "-->", 1, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5578,7 +5552,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
ctxt->instate = XML_PARSER_PROLOG;
} else if ((cur == '<') && (next == '?')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5615,7 +5589,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
if ((cur == '<') && (next == '!') &&
(in->cur[2] == '-') && (in->cur[3] == '-')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '-', '-', '>', 1, 1) < 0))
+ (htmlParseLookupSequence(ctxt, "-->", 1, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5625,7 +5599,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
ctxt->instate = XML_PARSER_EPILOG;
} else if ((cur == '<') && (next == '?')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5689,7 +5663,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
break;
}
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;

/* Capture start position */
@@ -5836,7 +5810,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
int idx;
xmlChar val;

- idx = htmlParseLookupSequence(ctxt, '<', '/', 0, 0, 0);
+ idx = htmlParseLookupSequence(ctxt, "</", 0, 0);
if (idx < 0)
goto done;
val = in->cur[idx + 2];
@@ -5863,7 +5837,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
(UPP(6) == 'Y') && (UPP(7) == 'P') &&
(UPP(8) == 'E')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
htmlParseErr(ctxt, XML_HTML_STRUCURE_ERROR,
"Misplaced DOCTYPE declaration\n",
@@ -5872,8 +5846,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
} else if ((cur == '<') && (next == '!') &&
(in->cur[2] == '-') && (in->cur[3] == '-')) {
if ((!terminate) &&
- (htmlParseLookupSequence(
- ctxt, '-', '-', '>', 1, 1) < 0))
+ (htmlParseLookupSequence(ctxt, "-->", 1, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5883,7 +5856,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
ctxt->instate = XML_PARSER_CONTENT;
} else if ((cur == '<') && (next == '?')) {
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
#ifdef DEBUG_PUSH
xmlGenericError(xmlGenericErrorContext,
@@ -5954,7 +5927,7 @@ htmlParseTryOrFinish(htmlParserCtxtPtr ctxt, int terminate) {
if (avail < 2)
goto done;
if ((!terminate) &&
- (htmlParseLookupSequence(ctxt, '>', 0, 0, 0, 1) < 0))
+ (htmlParseLookupSequence(ctxt, ">", 0, 1) < 0))
goto done;
htmlParseEndTag(ctxt);
if (ctxt->nameNr == 0) {
--
2.25.1

Loading