readability-selenium

Use Readability.js and Selenium to extract the useful text from a web page.

After trying many options, I found that solutions that try to do this without firing up a real web browser fail on too many sites that use lots of JavaScript to load pages. To work on the real web, I needed to just automate a real web browser.

This will inject Readability.js into the browser, execute it and fetch the results. It is essentially identical to visiting the page in Firefox and hitting the reader view button in the URL bar.

To run the example, just clone the repo, place your own copy of Readability.js alongside example.py, and run:

    python example.py https://github.com/mattblaha/readability-selenium

The simplest way I know of to setup a Selenium server that will work with the example is with docker (or podman):

    docker run -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:3.141.59

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
readability_selenium		readability_selenium
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readability-selenium

About

Releases

Packages

Contributors 3

Languages

License

mattblaha/readability-selenium

Folders and files

Latest commit

History

Repository files navigation

readability-selenium

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages