DiscoverPagination

A python package for discovering numbered page delineation in documents.

Repository

https://github.com/wharton/DiscoverPagination

Background

In the Research and Analytics Department we are asked for several different types of text processing assignments. These usually take the form of "please extract the X section from Y document type 10k times" Some of these have a Table of Contents, but it is difficult to use the ToC because we do not know which pages are which.

This package is designed to discover where pages are marked, and then reference those page numbers to get sections of text. Much of the work we do involves SEC filings, which are in a type of XML format. This is optimized for that type of document, but should do well in other cases.

Requirements

Python 3.6
fuzzywuzzy: Fuzzy matching
python-Levenshtein: Speeds up fuzzy matching library

Quickstart

Install

$ pip install DiscoverPagination

Usage

$ python
>> from discoverpagination import *
>> with open('./example_texts/0001193125-08-010038.txt') as inputfile:
...       doc = PaginatedDocument(inputfile, clean_xml=True)
>> pages = doc[20:22]
>> print(pages)
[' <P><FONT>19 </FONT></P>\n', '\n', '\n', '<p>\n', '<HR>\n', '\n', ' <P><FONT>The ...

Methods

The way the pages are discovered takes several steps and relies on a few assumptions.

Assumptions

Pages are marked
Page markings are in sequential order
Page markings use numeric characters
Pages are numbered at the end of page
Page numbers do not occur mixed with text. (There is an attempt to handle this case.)

Steps

Document is read from file
(OPTIONAL) XML documents are cleaned of tag attributes.
Document is scanned for page markers line by line, starting with "1". (Configurable)
As each number is found, the line index and text is stored in a Dict keyed to page number.
The page is incremented after each number is found until no more document lines remain.
The document is rescanned in reverse order to find page markers.
Page markers that are the same or nearby to each other are kept.
A common "best_match" format is determined by ranking each type of line.
The missing page numbers are scanned for with this "best_match" in the areas they should be. E.g. A missing page 5 is searched for between pages 4 and 6 with the best pattern.
If there are still missing pages it uses fuzzy matching to guess based on placement and pattern.
The document is returned and can be referenced by slicing. doc[10:12] gets lines for pages 10 to 12.

Tests

python setup.py test

Reference

fuzzywuzzy
python-Levenshtein
SEC EDGAR

Contributors

Douglas H. King

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
discoverpagination		discoverpagination
example_texts		example_texts
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiscoverPagination

Repository

Background

Requirements

Quickstart

Install

Usage

Methods

Assumptions

Steps

Tests

Reference

Contributors

License

About

Releases

Packages

Languages

License

wharton/DiscoverPagination

Folders and files

Latest commit

History

Repository files navigation

DiscoverPagination

Repository

Background

Requirements

Quickstart

Install

Usage

Methods

Assumptions

Steps

Tests

Reference

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages