Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

jpcaruana · 2023-08-10T07:52:48Z

Hi,

I posted https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ and Bridgy posted it to mastodon https://indieweb.social/@jpcaruana/110864216016868215.

In the process, the … char is displayed as &mldr; (unicode U+2026, Horizontal Ellipsis) in the mastodon post.

Emojis work fine though, see https://indieweb.social/@jpcaruana/110849548714309169 for instance.

How could I help here?

The text was updated successfully, but these errors were encountered:

snarfed · 2023-08-10T17:49:55Z

Interesting!

Looking at the Bridgy log, the mf2 parser ended up with two content values from your post, one plain text with the &mldr; HTML entity, one html + text:

{
      "type": [
        "h-entry"
      ],
      "properties": {
        "url": [
          "https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/"
        ],
        "content": [
          "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try.",
          {
            "html": "<h1></h1><p>Not sure. This post <a href=\"https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/\">[https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a>](https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/%3C/a%3E); is 501 chars long and <a href=\"https://indieweb.social/@jpcaruana/110836774445084366\">[https://indieweb.social/@jpcaruana/110836774445084366</a>](https://indieweb.social/@jpcaruana/110836774445084366%3C/a%3E); does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).</p><p>I haven\u2019t posted anything longer than 500 chars since&amp;mldr; I\u2019ll give it another try.</p>",
            "value": "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try."
          }
          "..."
        ]

snarfed · 2023-08-10T18:17:46Z

A few more data points:

Looking at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ right now, it has the … Unicode character directly in the content, inline
I tried previewing a publish on https://brid.gy/mastodon/@[email protected], and it rendered correctly with the same … character inline
Interestingly, the log for that preview showed the same &mldr; character in the parsed content as above
Neither https://pin13.net/mf2/?url=https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ nor https://python.microformats.io/?url=https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ have the &mldr; in content, they both have it as escaped \u2026 Unicode instead
...which implies the &mldr; is maybe Bridgy's fault somehow

snarfed · 2023-08-10T18:45:07Z

Oops, I was wrong. I looked at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ in browser dev tools, and evidently that shows me the contents after decoding HTML entities. curl and Python requests both show that the response actually has multiple HTML entities, notably both ’ and &mldr; inside *-content. Interestingly we decode ’ ok, but not &mldr;.

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content: <div class="p-name p-content e-content">. You probably only want one of those.)

snarfed · 2023-08-10T19:44:32Z

Looks like this is a BeautifulSoup thing. Bridgy and granary fetch posts with requests:

bridgy/webmention.py

Lines 66 to 71 in b26c885

    
           try: 
        
             resp = util.requests_get(url) 
        
             resp.raise_for_status() 
        
           except werkzeug.exceptions.HTTPException: 
        
             # raised by us, probably via self.error() 
        
             raise

...then parse it manually with BeautifulSoup, then pass that to mf2py:

https://github.com/snarfed/webutil/blob/63be8a763a618d43e957c6d414c0f6de8f298184/util.py#L1917-L1985

  if isinstance(input, requests.Response):
    content_type = input.headers.get('content-type') or ''
    input = input.text if 'charset' in content_type else input.content

  return bs4.BeautifulSoup(input, **kwargs)

def parse_mf2(input, url=None, id=None):
  ...
  return mf2py.parse(url=url, doc=input, img_with_alt=True)

When I do this outside of Bridgy/granary, BeautifulSoup converts ’ to ’ but doesn't recognize &mldr;:

>>> resp = util.requests_get('https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/')
>>> print(resp.text)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href=https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/>https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href=https://indieweb.social/@jpcaruana/110836774445084366>https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn&rsquo;t mess up with time zones).</p><p>I haven&rsquo;t posted anything longer than 500 chars since&mldr; I&rsquo;ll give it another try.</p></div>
...
>>> soup = bs4.BeautifulSoup(resp.text)
>>> print(soup)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href="https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/">https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href="https://indieweb.social/@jpcaruana/110836774445084366">https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn’t mess up with time zones).</p><p>I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.</p></div>
...

@capjamesg @angelogladding any thoughts here?

snarfed · 2023-08-10T20:01:27Z

Hmm, maybe it's an lxml thing?

>>> from bs4.diagnose import diagnose
>>> diagnose(resp.text)
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.9.16 (main, Dec  7 2022, 10:06:04)
[Clang 14.0.0 (clang-1400.0.29.202)]
Found lxml version 4.9.3.0
Found html5lib version 1.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
...
        I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
...
        I haven’t posted anything longer than 500 chars since… I’ll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
...
        I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.
...

snarfed · 2023-08-10T20:04:10Z

Looks like maybe yes.

>>> print(bs4.BeautifulSoup(text, 'html.parser'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'html5lib'))
...
<p>I haven’t posted anything longer than 500 chars since… I’ll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'lxml'))
...
<p>I haven’t posted anything longer than 500 chars since&amp;mldr; I’ll give it another try.</p>
...

snarfed · 2023-08-10T20:13:36Z

I see mldr in WHATWG's list of entities and in the 2011 HTML spec. I don't see anything obvious searching lxml's bug tracker for mldr or for entities. So I guess the next step is to file a bug with them?

capjamesg · 2023-08-10T20:23:51Z

@snarfed I was thinking about this and my first intuition was to try another parser like html5lib. It seems like an lxml issue.

snarfed · 2023-08-10T21:11:04Z

Filed https://bugs.launchpad.net/lxml/+bug/2031045

jpcaruana · 2023-08-11T09:37:43Z

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content:
. You probably only want one of those.)

Thank you, I know I had strange issues back in the days, so I ended up in duplicating indieweb class names. I'll take a look at it.

Which one is, in your opinion, the "one to keep"?

snarfed · 2023-08-11T16:23:18Z

Generally e-content, which preserves the inner HTML tags. Only use p-content if you want to collapse the value to plain text.

jpcaruana · 2023-08-12T09:47:38Z

cristal clear, thank you. Fixed on my side :)

snarfed · 2023-08-13T20:26:43Z

...aaand filed https://gitlab.gnome.org/GNOME/libxml2/-/issues/580

snarfed · 2023-08-14T20:43:33Z

Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

snarfed · 2023-08-15T02:09:00Z

Added this to the Bridgy docs: b37f45c, https://brid.gy/about#html-entities

capjamesg · 2023-08-15T09:16:18Z

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

jpcaruana · 2023-08-15T09:52:38Z

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

Do you know any alternative our there?

capjamesg · 2023-08-15T09:58:01Z

You can use html5lib with BeautifulSoup for HTML5 parsing, but the BeautifulSoup documentation says this parser is Very slow.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

jpcaruana · 2023-08-15T10:00:38Z

I found but did not test html5-parser: they claim it is fast:

A fast implementation of the HTML 5 parsing spec for Python

https://html5-parser.readthedocs.io/en/latest/

capjamesg · 2023-08-15T13:28:29Z

@jpcaruana Looks interesting!

There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system (using python 3) show a speedup of 37x.

snarfed · 2023-08-15T15:06:33Z

Thanks for the sleuthing, guys! Also I think BeautifulSoup's claims are something like 10y old or or more. So they may still be true, but they may well not be. 🤷

kevinmarks · 2023-08-15T15:25:03Z

I remember sknebel doing a bunch of performance work a while back https://github.com/microformats/mf2py/issues?q=label%3Aperformance+ - 5 years ago and some BeautifulSoup refactoring before that by kyle.

It may be time to look at backing out of BeautifulSoup in favour of a modern parser like html5-parser, which would be more like the way the Go parser works, but that would be a big refactor.

snarfed · 2023-08-15T15:36:23Z

Thanks @kevinmarks!

Not a top priority for me personally, but moving away from BeautifulSoup might only be a medium sized refactor, spread across a few packages, and we could do them independently.

snarfed · 2024-10-08T19:32:21Z

libxml2 reported a bunch of progress HTML5 support recently: https://gitlab.gnome.org/GNOME/libxml2/-/issues/211#note_2241407

snarfed changed the title ~~Some chars are not encoded correctly~~ Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? Aug 15, 2023

snarfed added the backfeed label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

jpcaruana commented Aug 10, 2023 •

edited

Loading

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023 •

edited

Loading

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

capjamesg commented Aug 10, 2023

snarfed commented Aug 10, 2023

jpcaruana commented Aug 11, 2023

snarfed commented Aug 11, 2023

jpcaruana commented Aug 12, 2023

snarfed commented Aug 13, 2023

snarfed commented Aug 14, 2023 •

edited

Loading

snarfed commented Aug 15, 2023

capjamesg commented Aug 15, 2023

jpcaruana commented Aug 15, 2023

capjamesg commented Aug 15, 2023

jpcaruana commented Aug 15, 2023

capjamesg commented Aug 15, 2023

snarfed commented Aug 15, 2023

kevinmarks commented Aug 15, 2023

snarfed commented Aug 15, 2023

snarfed commented Oct 8, 2024

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

Comments

jpcaruana commented Aug 10, 2023 • edited Loading

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023 • edited Loading

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

snarfed commented Aug 10, 2023

capjamesg commented Aug 10, 2023

snarfed commented Aug 10, 2023

jpcaruana commented Aug 11, 2023

snarfed commented Aug 11, 2023

jpcaruana commented Aug 12, 2023

snarfed commented Aug 13, 2023

snarfed commented Aug 14, 2023 • edited Loading

snarfed commented Aug 15, 2023

capjamesg commented Aug 15, 2023

jpcaruana commented Aug 15, 2023

capjamesg commented Aug 15, 2023

jpcaruana commented Aug 15, 2023

capjamesg commented Aug 15, 2023

snarfed commented Aug 15, 2023

kevinmarks commented Aug 15, 2023

snarfed commented Aug 15, 2023

snarfed commented Oct 8, 2024

jpcaruana commented Aug 10, 2023 •

edited

Loading

snarfed commented Aug 10, 2023 •

edited

Loading

snarfed commented Aug 14, 2023 •

edited

Loading