-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ignore entities. #273
base: main
Are you sure you want to change the base?
Conversation
Hmm, this is an interesting question, and you're right, it certain changes expectation. One other thing, just by looking at the test output, is that it's also complicated by the browser version, where that cheerio option isn't relevant. (In the browser, cheerio is replaced by a slightly shimmed version of jQuery.)
In terms of the output, what are the cons of the current behavior if you were to decode the one time, on initial load? Also, it's definitely a big decision to put the decoding on the user. Ideally we'd want Mercury to return the output as human-friendly as possible, since, even though Mercury is for computers, the goal is more or less to let our software see the page more like humans do. I'm very interested in your thoughts on this, though, and really appreciate the perspective. |
I think the browser version (assuming the Mercury Reader extension for Chrome is using that) already works without any entity encoding conversions.
Actually, it looks like passing Doing the reverse, passing
Maybe a hybrid would be best? What if only the |
This seems to me like a fair compromise. With the exception of The only other thing to consider is the newly added At that point it might be sensible to also adopt the new behavior before returning. |
@benubois Thanks again for looking into this. Just let me know when you've got something you'd like a review on. |
@adampash actually I'm a bit stuck on how best to proceed and could use your help! My latest attempt was to decode entities back into characters for the text keys. However, this resulted in a number of side-effects. For example, some extractors look for HTML attributes that contain JSON. When value="{"postID":34002,"authorType":"Fan Contributor","isVideo":false,"postType":"post"}" But in order for the extraction to work, it needs value='{"postID":34002, "authorType": "Fan Contributor", "isVideo": false, "postType":"post"} It also causes some problems with date parsing. Maybe because some encoded entities looks like years? i.e. I can think of a few option on how to proceed:
Any thoughts on these approaches or alternatives that come to mind? |
Hmm, you're right, this is a tough one to resolve. Even in the failing tests for the images, there are query parameters in the URL that actually make a difference in what the image looks like. (I.e., cropping params.) I'm leaning toward option 2, though like you said, it's potentially a big change, and loading cheerio twice is its own potential performance hit (though it may be negligible; in my experience, cheerio is pretty fast). If we went that route, the All things considered, it doesn't seem like it'd be a huge change (unless I'm missing something or being overly optimistic). If you have time, do you want to look into it? If you could also take a stab at whether or not you're noticing a performance hit, that would be interesting to see. (I'm also optimistic that it might not be that big of a deal.) Either way, let me know what you think! And thanks again for looking into this. |
Before going further with this I wanted to check if it's needed. Since the content is displayed as HTML, what are the downsides if the characters are left encoded? Would be interested to hear your thoughts @HenryQW. |
It’s probably fine with small use cases, I just recalled my old solution to this problem: The downside is the performance, as I’ve tested My original plan was to use Mercury in RSSHub. |
@benubois If this is only a problem for reading the HTML before it's actually rendered on a page, I agree that maybe it's not a huge concern. The one use case I might want to handle it more gracefully for would be the new alternate output options for content type: What do you think about that? |
This change will make mercury ignore html entities and special characters.
Here's an example of the output change.
This is a big change in behavior and breaks about 80 tests. I'm happy to go through and update the tests, but wanted to make sure that's the right direction.
For example, a lot of the tests rely on the old default behavior, where cheerio would convert HTML entities back to characters:
For example the title:
Would now come through as exactly that, instead of the old behavior where it would result in:
I think this is a better behavior, but it would be surprising if you were relying on the old way.
The big advantage here is that now everything that was getting converted to entities (just about every non-ASCII character) will be left alone.