Companies House Accounts

This project aims to independently source XBRL and PDF data from the Companies House accounts website, and produce a merged data set of processed data for build and archive. The high-level flow for each process is as follows:

XBRL

Web-scrape all XBRL data from the companies house accounts website.
Unpack xbrl files
Process and parse xml data / html method (BeautifulSoup), converting them into there csv equivalents.
Append xbrl csv file on an annual basis
Convert XBRL melt tables to Pivot tables.
Filter csv files to produce subsets of the xbrl tags taht are required for various internal ONS stakeholders.

PDF

Web-scrape filled accounts from companies house as pdf data.
Convert each page of the pdf into separate images.
Create a model of a Cascade Classifier
Apply Classifier to identify and extract the cover page and the balance sheet from converted filled accounts data.
Implement Classifier performance metrics to determine the accuracy and precision of the Classifier.
Apply Optical Character Recognition (OCR) to images that have been classified in step 5 to convert into text data.
Apply Natural Language Processes (NLP) to text data extracted in step 7 to extract patterns from the raw text data.
Merge processed XBRL data from step 2 with data generated from step 8.

Webscraping Policy

The webscraping done in this project is achieved by utilsing Scrapy and strictly adheres to the ONS Web Scraping Policy. For further information on Scrapy please see the following links:

Noting that the deployment of 'spiders' in this implimentation of Scrapy is automated so we will not cover how to initialise them.

Installation

Use the package manager pip to install this project's required modules and dependencies.

pip3 install {module}

Usage

Load main pipeline, and call subsidary modules.

from src.data_processing.cst_data_processing import DataProcessing
from src.classifier.cst_classifier import Classifier
from src.performance_metrics.binary_classifier_metrics import BinaryClassifierMetrics

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
data		data
docs		docs
src		src
tests		tests
third_party_apps		third_party_apps
.gitignore		.gitignore
LICENSE		LICENSE
__init__.py		__init__.py
cha_pipeline.cfg		cha_pipeline.cfg
cha_pipeline.py		cha_pipeline.py
readme.md		readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Companies House Accounts

XBRL

PDF

Webscraping Policy

Installation

Usage

Contributing

License

About

Releases

Packages

Languages

License

zorge69/companies-house-big-data-project

Folders and files

Latest commit

History

Repository files navigation

Companies House Accounts

XBRL

PDF

Webscraping Policy

Installation

Usage

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages