[html2text]

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Makdown (a text-to-HTML format).

This code is based on Aaron Swartz's html2text.py, which you can find at https://github.com/aaronsw/html2text. Aaron Swartz is a great programmer, and I admire him a lot.

The input is a root dictionary. You need to modify it in the code in this version. The program will recursively find all index.html in the given root dictionary. Then by using Aaron Swartz's html2text.py, I get the plain text and store it in index.txt which has same location with index.html. I ignore links and images contained in the html file by passing argument to html2text.py.

In order to delete some short sentences, I use nltk to tokenize the material. So you have to install nltk. After that, you need to download english grammer package following stackoverflow: http://stackoverflow.com/questions/26570944/resource-utokenizers-punkt-english-pickle-not-found

Usage: python main.py

One important thing to mention: if you haven't got package chardet installed, you'd better get one. It's easy to install by using pip:

pip install chardet

Test file is in dictionary test

TODO: I've found some UnicodeDecodeError in practical use. In this version I just ignore the error.

UPDATE:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test		test
README.md		README.md
html2text.py		html2text.py
html2text.pyc		html2text.pyc
main.py		main.py
plain_text.tar.gz		plain_text.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[html2text]

About

Releases

Packages

Languages

idear1203/html2text

Folders and files

Latest commit

History

Repository files navigation

[html2text]

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages