-
Notifications
You must be signed in to change notification settings - Fork 27.5k
fix(ngSanitize): follow HTML parser rules for start tags / allow < in text content #8212
Conversation
… text content ngSanitize will now permit opening braces in text content, provided they are not followed by either an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with rules of the parsing spec, without taking insertion mode into account. BREAKING CHANGE Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters in a start-tag. Following this change, any opening angle bracket which is not followed by either a forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).
html = html.substring( match[0].length ); | ||
match[0].replace( START_TAG_REGEXP, parseStartTag ); | ||
// We only have a valid start-tag if there is a '>'. | ||
if ( match[4] ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @IgorMinar PTAL --- This particular block is only here to make sure that we throw if we find an apparent start-tag without a trailing >
This might not be the right thing to do --- if we don't have a trailing >
, we could potentially just treat it as a text node. I'm not sure what the best thing to do in this case is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to treat as a text node. IMO the sanitizer should be secure but tolerant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine.
it('should throw badparse if text content contains "<" followed by an ASCII letter without matching ">"', function() { | ||
expect(function() { | ||
htmlParser('foo <a bar', handler); | ||
}).toThrowMinErr('$sanitize', 'badparse', 'The sanitizer was unable to parse the following block of html: <a bar'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this really a bad text string? I would let it go as a text block. For instance:
In my math project I found that a<b when b=10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as HTML parsing is concerned, /</[a-zA-Z/
is the start of a tag, so we shouldn't "fix" this, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although arguably we are not trying to "parse" html here, only sanitize text that may be inadvertently parsed by a browser later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is right. we shouldn't try to fix broken html.
Other than that LGTM |
it('should accept tag delimiters such as "<" inside real tags', function() { | ||
// Assert that the < is part of the text node content, and not part of a tag name. | ||
htmlParser('<p> 10 < 100 </p>', handler); | ||
expect(text).toEqual(' 10 < 100 '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this <
be encoded just to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is encoded in the real world, however in the test, the chars handler just appends the value to a string
LGTM except for the one test where |
We're passing a handler to |
I see. Thanks for the explanation. LGTM then. |
I still don't think that text containing a |
@petebacondarwin maybe we should see how people react. I agree that it kind of sucks |
… text content ngSanitize will now permit opening braces in text content, provided they are not followed by either an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with rules of the parsing spec, without taking insertion mode into account. BREAKING CHANGE Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters in a start-tag. Following this change, any opening angle bracket which is not followed by either a forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html). Closes #8212 Closes #8193
… text content ngSanitize will now permit opening braces in text content, provided they are not followed by either an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with rules of the parsing spec, without taking insertion mode into account. BREAKING CHANGE Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters in a start-tag. Following this change, any opening angle bracket which is not followed by either a forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html). Closes angular#8212 Closes angular#8193
ngSanitize will now permit opening braces in text content, provided they
are not followed by either an unescaped backslash, or by an ASCII letter
(u+0041 - u+005A, u+0061 - u+007A), in compliance with rules of the parsing
spec, without taking insertion mode into account.
BREAKING CHANGE
Previously, $sanitize would "fix" invalid markup in which a space preceded
alphanumeric characters in a start-tag. Following this change, any opening
angle bracket which is not followed by either a forward slash, or by an
ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec
(http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).
Closes #8193