Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Ignore entities. #273

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

benubois
Copy link
Contributor

This change will make mercury ignore html entities and special characters.

Here's an example of the output change.

This is a big change in behavior and breaks about 80 tests. I'm happy to go through and update the tests, but wanted to make sure that's the right direction.

For example, a lot of the tests rely on the old default behavior, where cheerio would convert HTML entities back to characters:

For example the title:

AT&T just declared war on an open internet (and us)

Would now come through as exactly that, instead of the old behavior where it would result in:

AT&T just declared war on an open internet (and us)

I think this is a better behavior, but it would be surprising if you were relying on the old way.

The big advantage here is that now everything that was getting converted to entities (just about every non-ASCII character) will be left alone.

@adampash
Copy link
Contributor

Hmm, this is an interesting question, and you're right, it certain changes expectation. One other thing, just by looking at the test output, is that it's also complicated by the browser version, where that cheerio option isn't relevant. (In the browser, cheerio is replaced by a slightly shimmed version of jQuery.)

The big advantage here is that now everything that was getting converted to entities (just about every non-ASCII character) will be left alone.

In terms of the output, what are the cons of the current behavior if you were to decode the one time, on initial load?

Also, it's definitely a big decision to put the decoding on the user. Ideally we'd want Mercury to return the output as human-friendly as possible, since, even though Mercury is for computers, the goal is more or less to let our software see the page more like humans do.

I'm very interested in your thoughts on this, though, and really appreciate the perspective.

@benubois
Copy link
Contributor Author

One other thing, just by looking at the test output, is that it's also complicated by the browser version, where that cheerio option isn't relevant. (In the browser, cheerio is replaced by a slightly shimmed version of jQuery.)

I think the browser version (assuming the Mercury Reader extension for Chrome is using that) already works without any entity encoding conversions.

screen shot 2019-02-13 at 10 28 38 am

In terms of the output, what are the cons of the current behavior if you were to decode the one time, on initial load?

Actually, it looks like passing decodeEntities: false only to cheerio.load() and not $.html() has the same effect as passing it to both.

Doing the reverse, passing decodeEntities: false only on output, causes the most surprising behavior. I parsed this XSS Filter Evasion Cheat Sheet with mercury and the result was entities that were already HTML encoded, were converted back into unencoded characters. For example > got converted to >. On this page this is especially bad because it contains so many JavaScript code samples. These became executable once they went through mercury.

Also, it's definitely a big decision to put the decoding on the user. Ideally we'd want Mercury to return the output as human-friendly as possible, since, even though Mercury is for computers, the goal is more or less to let our software see the page more like humans do.

Maybe a hybrid would be best? What if only the content key adopted the new behavior of decodeEntities: false, and all other keys retained the current behavior. I would assume content always gets used in an HTML context (browsers/web views) but other keys like title might be used in other contexts (like native views) where HTML encoded characters would look wrong unless the authors knew to HTML decode them.

@adampash
Copy link
Contributor

Maybe a hybrid would be best? What if only the content key adopted the new behavior of decodeEntities: false, and all other keys retained the current behavior. I would assume content always gets used in an HTML context (browsers/web views) but other keys like title might be used in other contexts (like native views) where HTML encoded characters would look wrong unless the authors knew to HTML decode them.

This seems to me like a fair compromise. With the exception of content, everything else is just plain text.

The only other thing to consider is the newly added contentType option, which allows the user to request the content is converted to markdown or just text:

https://github.com/postlight/mercury-parser/blob/b044cfa958014beae03c6f69a3b32499abe765af/src/mercury.js#L87-L92

At that point it might be sensible to also adopt the new behavior before returning.

@adampash
Copy link
Contributor

@benubois Thanks again for looking into this. Just let me know when you've got something you'd like a review on.

@benubois
Copy link
Contributor Author

@adampash actually I'm a bit stuck on how best to proceed and could use your help! My latest attempt was to decode entities back into characters for the text keys.

However, this resulted in a number of side-effects. For example, some extractors look for HTML attributes that contain JSON. When decodeEntites: false is passed, these attributes look like this:

value="{"postID":34002,"authorType":"Fan Contributor","isVideo":false,"postType":"post"}"

But in order for the extraction to work, it needs decodeEntites: true so it can be parsed as JSON:

value='{"postID":34002, "authorType": "Fan Contributor", "isVideo": false, "postType":"post"}

It also causes some problems with date parsing. Maybe because some encoded entities looks like years? i.e. —

I can think of a few option on how to proceed:

  1. Continue with this idea of decoding on output. There are about 9 failing tests, but some of them seem tricky to resolve.
  2. Revert back to decodeEntites: true as the default, but also do a second extraction just for the content with decodeEntites: false. Not really sure what would be needed to make this happen, seems like a big change.

Any thoughts on these approaches or alternatives that come to mind?

@adampash
Copy link
Contributor

Hmm, you're right, this is a tough one to resolve. Even in the failing tests for the images, there are query parameters in the URL that actually make a difference in what the image looks like. (I.e., cropping params.)

I'm leaning toward option 2, though like you said, it's potentially a big change, and loading cheerio twice is its own potential performance hit (though it may be negligible; in my experience, cheerio is pretty fast).

If we went that route, the Resource.create may return two documents — the decoded and non-decoded docs. Then you'd pass those to the RootExtractor, which would just need to make sure to pass the latter specifically to the content extractor.

All things considered, it doesn't seem like it'd be a huge change (unless I'm missing something or being overly optimistic). If you have time, do you want to look into it? If you could also take a stab at whether or not you're noticing a performance hit, that would be interesting to see. (I'm also optimistic that it might not be that big of a deal.)

Either way, let me know what you think! And thanks again for looking into this.

@benubois
Copy link
Contributor Author

Before going further with this I wanted to check if it's needed.

Since the content is displayed as HTML, what are the downsides if the characters are left encoded?

Would be interested to hear your thoughts @HenryQW.

@HenryQW
Copy link

HenryQW commented Feb 21, 2019

It’s probably fine with small use cases, I just recalled my old solution to this problem:

https://github.com/HenryQW/OpenCC.henry.wang/blob/07909d16abe4644efd068a01cf8acdef136871a0/route.js#L7-L21

The downside is the performance, as I’ve tested replace() with multiple RegEx on HTML is really really really computationally expensive.

My original plan was to use Mercury in RSSHub.

@adampash
Copy link
Contributor

adampash commented Feb 26, 2019

@benubois If this is only a problem for reading the HTML before it's actually rendered on a page, I agree that maybe it's not a huge concern. The one use case I might want to handle it more gracefully for would be the new alternate output options for content type:

https://github.com/postlight/mercury-parser/blob/ed14203e9788ca3975060d6eb1f1d33642ac15d7/src/mercury.js#L87-L92

What do you think about that?

@mtashley mtashley closed this Aug 21, 2019
@mtashley mtashley reopened this Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants