Skip to content
This repository has been archived by the owner on Apr 12, 2024. It is now read-only.

fix(ngSanitize): follow HTML parser rules for start tags / allow < in text content #8212

Closed
wants to merge 2 commits into from

Conversation

caitp
Copy link
Contributor

@caitp caitp commented Jul 16, 2014

ngSanitize will now permit opening braces in text content, provided they
are not followed by either an unescaped backslash, or by an ASCII letter
(u+0041 - u+005A, u+0061 - u+007A), in compliance with rules of the parsing
spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded
alphanumeric characters in a start-tag. Following this change, any opening
angle bracket which is not followed by either a forward slash, or by an
ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec
(http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes #8193

sylvain-hamel and others added 2 commits July 15, 2014 13:05
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).
html = html.substring( match[0].length );
match[0].replace( START_TAG_REGEXP, parseStartTag );
// We only have a valid start-tag if there is a '>'.
if ( match[4] ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @IgorMinar PTAL --- This particular block is only here to make sure that we throw if we find an apparent start-tag without a trailing >

This might not be the right thing to do --- if we don't have a trailing >, we could potentially just treat it as a text node. I'm not sure what the best thing to do in this case is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to treat as a text node. IMO the sanitizer should be secure but tolerant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine.

it('should throw badparse if text content contains "<" followed by an ASCII letter without matching ">"', function() {
expect(function() {
htmlParser('foo <a bar', handler);
}).toThrowMinErr('$sanitize', 'badparse', 'The sanitizer was unable to parse the following block of html: <a bar');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this really a bad text string? I would let it go as a text block. For instance:

In my math project I found that a<b when b=10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as HTML parsing is concerned, /</[a-zA-Z/ is the start of a tag, so we shouldn't "fix" this, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although arguably we are not trying to "parse" html here, only sanitize text that may be inadvertently parsed by a browser later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is right. we shouldn't try to fix broken html.

@petebacondarwin
Copy link
Contributor

Other than that LGTM

it('should accept tag delimiters such as "<" inside real tags', function() {
// Assert that the < is part of the text node content, and not part of a tag name.
htmlParser('<p> 10 < 100 </p>', handler);
expect(text).toEqual(' 10 < 100 ');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this < be encoded just to be safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is encoded in the real world, however in the test, the chars handler just appends the value to a string

@IgorMinar
Copy link
Contributor

LGTM except for the one test where < is not encoded. how is it different from the test on line 87?

@caitp
Copy link
Contributor Author

caitp commented Jul 16, 2014

We're passing a handler to htmlParser() which just appends the parsed text to a string, to assert that a certain substring was treated as text content. That's why it's not encoded here. In the case of expectHTML(), where we're using the real parser, the value is encoded because of the handler used by $sanitize

@IgorMinar
Copy link
Contributor

I see. Thanks for the explanation. LGTM then.

@petebacondarwin
Copy link
Contributor

I still don't think that text containing a some b<a thing should throw an exception. It should just be sanitized to some b&lt;a thing

@caitp caitp closed this in f6681d4 Jul 16, 2014
@caitp
Copy link
Contributor Author

caitp commented Jul 16, 2014

@petebacondarwin maybe we should see how people react. I agree that it kind of sucks

caitp added a commit that referenced this pull request Jul 16, 2014
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes #8212
Closes #8193
caitp added a commit to caitp/angular.js that referenced this pull request Jul 16, 2014
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes angular#8212
Closes angular#8193
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants