-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535
Comments
Interesting! Looking at the Bridgy log, the mf2 parser ended up with two {
"type": [
"h-entry"
],
"properties": {
"url": [
"https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/"
],
"content": [
"Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since… I\u2019ll give it another try.",
{
"html": "<h1></h1><p>Not sure. This post <a href=\"https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/\">[https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a>](https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/%3C/a%3E); is 501 chars long and <a href=\"https://indieweb.social/@jpcaruana/110836774445084366\">[https://indieweb.social/@jpcaruana/110836774445084366</a>](https://indieweb.social/@jpcaruana/110836774445084366%3C/a%3E); does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).</p><p>I haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try.</p>",
"value": "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since… I\u2019ll give it another try."
}
"..."
] |
A few more data points:
|
Oops, I was wrong. I looked at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ in browser dev tools, and evidently that shows me the contents after decoding HTML entities. curl and Python requests both show that the response actually has multiple HTML entities, notably both (Btw @jpcaruana it looks like the reason you have multiple values for |
Looks like this is a BeautifulSoup thing. Bridgy and granary fetch posts with requests: Lines 66 to 71 in b26c885
...then parse it manually with BeautifulSoup, then pass that to mf2py: if isinstance(input, requests.Response):
content_type = input.headers.get('content-type') or ''
input = input.text if 'charset' in content_type else input.content
return bs4.BeautifulSoup(input, **kwargs) def parse_mf2(input, url=None, id=None):
...
return mf2py.parse(url=url, doc=input, img_with_alt=True) When I do this outside of Bridgy/granary, BeautifulSoup converts >>> resp = util.requests_get('https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/')
>>> print(resp.text)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href=https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/>https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href=https://indieweb.social/@jpcaruana/110836774445084366>https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn’t mess up with time zones).</p><p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p></div>
...
>>> soup = bs4.BeautifulSoup(resp.text)
>>> print(soup)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href="https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/">https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href="https://indieweb.social/@jpcaruana/110836774445084366">https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn’t mess up with time zones).</p><p>I haven’t posted anything longer than 500 chars since&mldr; I’ll give it another try.</p></div>
... @capjamesg @angelogladding any thoughts here? |
Hmm, maybe it's an lxml thing? >>> from bs4.diagnose import diagnose
>>> diagnose(resp.text)
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.9.16 (main, Dec 7 2022, 10:06:04)
[Clang 14.0.0 (clang-1400.0.29.202)]
Found lxml version 4.9.3.0
Found html5lib version 1.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
...
I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
...
I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
...
I haven’t posted anything longer than 500 chars since&mldr; I’ll give it another try.
... |
Looks like maybe yes. >>> print(bs4.BeautifulSoup(text, 'html.parser'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'html5lib'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'lxml'))
...
<p>I haven’t posted anything longer than 500 chars since&mldr; I’ll give it another try.</p>
... |
I see |
@snarfed I was thinking about this and my first intuition was to try another parser like |
Thank you, I know I had strange issues back in the days, so I ended up in duplicating indieweb class names. I'll take a look at it. Which one is, in your opinion, the "one to keep"? |
Generally |
cristal clear, thank you. Fixed on my side :) |
...aaand filed https://gitlab.gnome.org/GNOME/libxml2/-/issues/580 |
Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 |
Added this to the Bridgy docs: b37f45c, https://brid.gy/about#html-entities |
Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup. |
Do you know any alternative our there? |
You can use |
I found but did not test
|
@jpcaruana Looks interesting!
|
Thanks for the sleuthing, guys! Also I think BeautifulSoup's claims are something like 10y old or or more. So they may still be true, but they may well not be. 🤷 |
I remember sknebel doing a bunch of performance work a while back https://github.com/microformats/mf2py/issues?q=label%3Aperformance+ - 5 years ago and some BeautifulSoup refactoring before that by kyle. It may be time to look at backing out of BeautifulSoup in favour of a modern parser like html5-parser, which would be more like the way the Go parser works, but that would be a big refactor. |
Thanks @kevinmarks! Not a top priority for me personally, but moving away from BeautifulSoup might only be a medium sized refactor, spread across a few packages, and we could do them independently. |
libxml2 reported a bunch of progress HTML5 support recently: https://gitlab.gnome.org/GNOME/libxml2/-/issues/211#note_2241407 |
Hi,
I posted https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ and Bridgy posted it to mastodon https://indieweb.social/@jpcaruana/110864216016868215.
In the process, the
…
char is displayed as…
(unicode U+2026, Horizontal Ellipsis) in the mastodon post.Emojis work fine though, see https://indieweb.social/@jpcaruana/110849548714309169 for instance.
How could I help here?
The text was updated successfully, but these errors were encountered: