The JRIMS CAL Extractor is a Python script created in Visual Studio Code using the PyPDF2 and Openpyxl libraries to automate extraction of CAL alignments values from pre-existing JRIMS documents.
One of the gaps in DHS's ability to properly assess the impacts of new JRIMS documents is in identifying potential overlap with capability needs identified in previous JRIMS documents. To address this issue, CAL alignment values must be efficiently extracted from these aforementioned documents. Although initial attempts involved manual extraction of the data, the need to expedite the process soon became apparent. With 700+ JRIMS documents left to extract capability data from, this Python-based solution that automated CAL extraction was created by Justin Oh, a CTOD intern.
Throughout these instructions, Anaconda Navigator will be used due to its compatibility with DHS laptops. Many commonly used Python libraries are pre-installed with Anaconda, although PyPDF2 is a noteworthy exception. All of the following directions are possible through the conventional method of downloading Python libraries (PyPDF2, openpyxl) through PIP and running the script through Visual Studio Code's terminal.
Above: Several packages available through the Anaconda Navigator.
1. Download the latest version of the JRIMS CAL Extractor and store them in the same file directory as the JRIMS PDF documents that require scanning. If available, store an Excel file of official CAL alignment values in this directory as well.
Above: The Python script, CAL Alignments Excel file and three JRIMS documents are stored in the same directory.
Above: Revised code containing the details needed for the scanning process of three JRIMS documents.
3. After saving the revised Python file, open the Anaconda Command Prompt. Change the directory using the "cd" command to the location of the files.
Above: Anaconda Command Prompt in action.
4. Finally, run the command "python [file name].py" to extract the CAL alignment values for each JRIMS document.
Above: CAL alignment values and their corresponding descriptions are outputted per file. Data is censored due to sensitive nature.
Before widespread usage, the Python script must address an area of concern presented in its output accuracy and also adopt necessary changes to allow outputs to be automatically stored on a separate database for further analysis.
The Python script looks for series of four numbers (separated by periods) with similar configurations as CAL alignment values. However, "Table of Contents" values may sometimes qualify under the criteria of the code and be unintentionally outputted. Although the script cross-references an Excel file with official CAL values, there is still a considerable margin of error that must be reduced. A potential idea is to exclude pages with excessive values from the scan to mitigate the "Table of Contents" issue.
Storing the outputs into databases such as Mobius is the end goal of this automation project but is by far the most important future objective. If achieved, the potential visualizations that could be created out of the stored data would be incredibly valuable to DHS.