This is an experiment on CRF for article content extraction. When you are trying to get clean data from a website, usually the extraction is getting in the way. For example, I want to extract news data from certain online media. It's getting to the point where I need to create automatic content extraction instead of defining XPath for every website out there. This is the goal of the tool, to easily extract article content with minimal errors.
Install all the needed requirement first.
To create a model, use this command
python generate_model.py
It will generate model on model folder as well as pickled training-ready data from dataset in pickle folder.
To use it, use the command
python extract.py --url some.url.com
Note that this is not production ready thus need more implementation in order to make it ready for use.
I use CRFSuite with binding for Python (python-crfsuite) implementation for the CRF and using LBFGS as algorithm. The train is only 25 data, validation 10 data, and test 5 data of website that never seen before on train data. While the data is really small, it's have a decent performance overall.
The features are: tag, parent tag, tag chain (tag and parent tag), length text before, length text after, length text content, and word count.
Compared to similar CRF experimentation on Victor: the Web-Page Cleaning Tool this one have greater perfomance (based on precision and recall) and less feature which makes it more general (test data contains 4 different languages) but since the dataset on this one is really small, I couldn't guarantee it.
% | |
---|---|
Precision | 96 |
Recall | 86 |
F1 | 91 |
content | ignore |
---|---|
97 | 16 |
4 | 9419 |
% | |
---|---|
Precision | 91 |
Recall | 93 |
F1 | 92 |
content | ignore |
---|---|
71 | 5 |
7 | 2640 |