-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSN article returns NULL #926
Comments
I'm not a Readability maintainer but I've done a bit of scraping. You're going to have two main problems with msn.com:
To get past 1, you can use playwright or similar browser automation. You may already be doing this, not sure. You'll need to wait for the page to have loaded the article JSON and written it into the DOM. Using 2 is trickier. To solve that, you'll need to run a script like this one to extract the content from the shadow DOM nodes. You can run the script in the context of the playwright page with if (node instanceof Document) {
return extractHTML(node.documentElement);
} just before the Once you've got that html from playwright, pass it to Readability and it will work. You may want to remove elements matching this selector from the DOM first: Problem 1 isn't really a problem for Readability to solve. Problem 2 arguably might be. It might be nice if Readability was able to dive into shadow DOM elements. Maybe |
MSN article https://www.msn.com/en-us/news/world/south-korean-president-apologizes-for-declaring-martial-law-as-he-faces-impeachment-vote/ar-AA1vpHO2 simply returns null without errors.
I'm new to the library, is it expected when article is not readable or is this a bug?
The text was updated successfully, but these errors were encountered: