Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSN article returns NULL #926

Open
bayramn opened this issue Dec 7, 2024 · 1 comment
Open

MSN article returns NULL #926

bayramn opened this issue Dec 7, 2024 · 1 comment

Comments

@bayramn
Copy link

bayramn commented Dec 7, 2024

MSN article https://www.msn.com/en-us/news/world/south-korean-president-apologizes-for-declaring-martial-law-as-he-faces-impeachment-vote/ar-AA1vpHO2 simply returns null without errors.

I'm new to the library, is it expected when article is not readable or is this a bug?

@danielnixon
Copy link

I'm not a Readability maintainer but I've done a bit of scraping. You're going to have two main problems with msn.com:

  1. It loads the article text dynamically (with JS; after the main page has loaded). If you use a 'naive' method of downloading the article (e.g. fetch, curl, wget), you're only ever going to get the skeleton HTML and not the article content itself. This comes in a subsequent request, as a JSON blob. Watch your network tab in your browser dev tools and you'll spot it.
  2. It puts the article text in a shadow DOM element (which Readability doesn't seem to extract).

To get past 1, you can use playwright or similar browser automation. You may already be doing this, not sure. You'll need to wait for the page to have loaded the article JSON and written it into the DOM. Using waitUntil: "networkidle" in page.goto is discouraged but gets the job done. Once you get that working, you'll probably be better off moving to waiting for a selector that you know only appears on the page once the article is loaded.

2 is trickier. To solve that, you'll need to run a script like this one to extract the content from the shadow DOM nodes. You can run the script in the context of the playwright page with page.evaluate(thatScript). I use a slight variation of that script that can handle being passed the document node, so I can just call return extractHTML(document); to serialize the whole page, including the html and head elements. The main addition you'd need to make to that script is basically:

    if (node instanceof Document) {
      return extractHTML(node.documentElement);
    }

just before the //beyond here, only deal with element nodes bail out.

Once you've got that html from playwright, pass it to Readability and it will work.

You may want to remove elements matching this selector from the DOM first: ".article-video-slot, video-card". Otherwise, you'll see (useless) video player elements at the top of the article produced by Readability. But that's just a minor point relative to everything else.

Problem 1 isn't really a problem for Readability to solve. Problem 2 arguably might be. It might be nice if Readability was able to dive into shadow DOM elements. Maybe _getNextNode should check node.shadowRoot && node.shadowRoot.firstElementChild as well as just checking node.firstElementChild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants