This project aims to independently source XBRL and PDF data from the Companies House accounts website, and produce a merged data set of processed data for build and archive. The high-level flow for each process is as follows:
- Web-scrape all XBRL data from the companies house accounts website.
- Unpack xbrl files
- Process and parse xml data / html method (BeautifulSoup), converting them into there csv equivalents.
- Append xbrl csv file on an annual basis
- Convert XBRL melt tables to Pivot tables.
- Filter csv files to produce subsets of the xbrl tags taht are required for various internal ONS stakeholders.
- Web-scrape filled accounts from companies house as pdf data.
- Convert each page of the pdf into separate images.
- Create a model of a Cascade Classifier
- Apply Classifier to identify and extract the cover page and the balance sheet from converted filled accounts data.
- Implement Classifier performance metrics to determine the accuracy and precision of the Classifier.
- Apply Optical Character Recognition (OCR) to images that have been classified in step 5 to convert into text data.
- Apply Natural Language Processes (NLP) to text data extracted in step 7 to extract patterns from the raw text data.
- Merge processed XBRL data from step 2 with data generated from step 8.
The webscraping done in this project is achieved by utilsing Scrapy and strictly adheres to the ONS Web Scraping Policy. For further information on Scrapy please see the following links:
Noting that the deployment of 'spiders' in this implimentation of Scrapy is automated so we will not cover how to initialise them.
Use the package manager pip to install this project's required modules and dependencies.
pip3 install {module}
Load main pipeline, and call subsidary modules.
from src.data_processing.cst_data_processing import DataProcessing
from src.classifier.cst_classifier import Classifier
from src.performance_metrics.binary_classifier_metrics import BinaryClassifierMetrics
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.