Skip to content

Auto-CORPus pipeline developed by a University of Leicester and Imperial College London collaboration to standardize text and table data extracted from full text publications.

Notifications You must be signed in to change notification settings

tb143/Auto-CORPus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoCORPus recognises 3 tyes of file which are:

  • Full text HTML documents covering the entire aticle
  • HTML files which describe a single table
  • Images of tables.

If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then autoCORPus will first group files within directories based on common elements in their file name and process all related files at once. Related files in separate directories will not be processed at the same time. Files processed at the same time will be output into the same files, an example input and output directory can be seen below:

input:

PMC1.html
PMC1_table_1.html
PMC1_table_2.png
/subdir
    PMC1_table_3.HTML
    PMC1_table_4.png

output:

PMC1_bioc.json
PMC1_abbreviations.json
PMC1_tables.json (contains table 1 & 2 and any tables described within the main text)
/subdir
    PMC1_tables.json (contains tables 3 & 4 only)

$ git clone [email protected]:Tom-Shorter/autoCORPus.git

$ cd autoCORPus

$ python3 -m venv env or (for Windows users) py -[v] -m venv env (where v is the version of Python used)

$ source env/bin/activate or (for Windows users) path/to/env/Scripts/activate.bat

$ pip install .

You might get an error here ModuleNotFoundError: No module named 'skbuild' if you do then run

$ pip install --upgrade pip

or you might need to install the Microsoft Build Tools for Visual Studio (see https://www.scivision.dev/python-windows-visual-c-14-required for minimal installation requirements so that python-Levenshtein package can be installed) first and then re run

$ pip install .

Run the below command for a single file example

$ python run_app.py -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON

run the below command for a directory of files example

$ python run_app.py -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON

Available arguments:

-f (input file path) - file or directory to run autoCORPus on.

-o (output type) - either JSON or XML (defaults to JSON)

-c (config) - which config file to use

About

Auto-CORPus pipeline developed by a University of Leicester and Imperial College London collaboration to standardize text and table data extracted from full text publications.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 68.2%
  • Python 31.8%