MSN article returns NULL #926

bayramn · 2024-12-07T03:11:18Z

MSN article https://www.msn.com/en-us/news/world/south-korean-president-apologizes-for-declaring-martial-law-as-he-faces-impeachment-vote/ar-AA1vpHO2 simply returns null without errors.

I'm new to the library, is it expected when article is not readable or is this a bug?

danielnixon · 2024-12-14T06:40:47Z

I'm not a Readability maintainer but I've done a bit of scraping. You're going to have two main problems with msn.com:

It loads the article text dynamically (with JS; after the main page has loaded). If you use a 'naive' method of downloading the article (e.g. fetch, curl, wget), you're only ever going to get the skeleton HTML and not the article content itself. This comes in a subsequent request, as a JSON blob. Watch your network tab in your browser dev tools and you'll spot it.
It puts the article text in a shadow DOM element (which Readability doesn't seem to extract).

To get past 1, you can use playwright or similar browser automation. You may already be doing this, not sure. You'll need to wait for the page to have loaded the article JSON and written it into the DOM. Using waitUntil: "networkidle" in page.goto is discouraged but gets the job done. Once you get that working, you'll probably be better off moving to waiting for a selector that you know only appears on the page once the article is loaded.

2 is trickier. To solve that, you'll need to run a script like this one to extract the content from the shadow DOM nodes. You can run the script in the context of the playwright page with page.evaluate(thatScript). I use a slight variation of that script that can handle being passed the document node, so I can just call return extractHTML(document); to serialize the whole page, including the html and head elements. The main addition you'd need to make to that script is basically:

    if (node instanceof Document) {
      return extractHTML(node.documentElement);
    }

just before the //beyond here, only deal with element nodes bail out.

Once you've got that html from playwright, pass it to Readability and it will work.

You may want to remove elements matching this selector from the DOM first: ".article-video-slot, video-card". Otherwise, you'll see (useless) video player elements at the top of the article produced by Readability. But that's just a minor point relative to everything else.

Problem 1 isn't really a problem for Readability to solve. Problem 2 arguably might be. It might be nice if Readability was able to dive into shadow DOM elements. Maybe _getNextNode should check node.shadowRoot && node.shadowRoot.firstElementChild as well as just checking node.firstElementChild.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSN article returns NULL #926

MSN article returns NULL #926

bayramn commented Dec 7, 2024

danielnixon commented Dec 14, 2024

MSN article returns NULL #926

MSN article returns NULL #926

Comments

bayramn commented Dec 7, 2024

danielnixon commented Dec 14, 2024