Skip to content

Latest commit

 

History

History
60 lines (47 loc) · 2.49 KB

README.md

File metadata and controls

60 lines (47 loc) · 2.49 KB

CEF Nominal Rolls

Thanks to the members of the CEFSG, the Nominal Rolls of the Canadian Expeditionary Force are available in digital format for researchers to peruse. Unfortunately, unlike the Australian Nominal Rolls, they are images, and thus not easily searchable or readable by computer.

This repository contains code for processing and analyzing the Nominal Rolls, and hopefully will become a simpler way to browse or search (if everything goes according to plan.)

Note: Actual Nominal Rolls are not contained in this repository because they would far exceed GitHub's hosting quota. Please see the CEFSG site for information on obtaining the PDFs.

Requirements

Nominal Rolls are stored locally using git annex, but due to their size, are not available here. You will need to download them yourself from the CEFSG site above if you wish to analyze them.

Scripts are written in Python and may require the following:

Most of these can be installed via your package manager or Python distribution.

Available Scripts

abbyy2pdf.py converts an ABBYY XML file into a PDF. This script is merely a test to see what sort of results the OCR process produced. It is not intended to produce a professional PDF. Characters may be the wrong size or off the baseline, but the result should give a general idea of the result.

abbyy2csv.py converts an ABBYY XML file into a CSV. This script uses the clustering algorithms of scikit-learn to partition the text data into rows and columns, and produces a tabular CSV result. The defaults parameters will likely produce terrible results, though.

csv2web.py converts the CSV files into a format suitable for publishing on GitHub.

Licensing

Scripts are licensed under the terms of the GPL v3. The CSV database is made available under Open Database License whose full text can be found at http://opendatacommons.org/licenses/odbl/ or here. Any rights in individual contents of the database are licensed under the Database Contents License whose text can be found at http://opendatacommons.org/licenses/dbcl/ or here.