MARC pipeline for quality assessment preparation. The purpose of this project to provide an automatic way to convert MARC binary or MARCXML files to JSON files ready to be processed by Apache Spark. It
- transforms binary MARC files to MARCXML (with yaz-marcdump)
- normalizes the UTF-8 encoding (with uconv),
- transforms MARCXML to JSON (with Catmandu)
- reformats the JSON files
The final JSON contains one record per line -- this is the way Apache Spark ingest files. Other differences between Catmandu produced JSON, and the JSON this project produces:
- the order of the components is the same in every records (in Librecat output the order of components is varying)
- the
datafield
'ssubfield
component is always an array of object (in Librecat output it is an object if there is only one subfield)
- Catmandu (http://librecat.org/)
- yaz-marcdump (manual, usage examples)
- uconv (manual)
Catmandu requires a special installation, the other two tools are available as standard *nix tools.
one-file-to-json.sh
- convert xml to json with Catmanduone-json-to-formatted.sh
- change the json format generated by Catmandu with theformatCatmanduOutput.php
script
marc-to-xml.sh
- convert binary MARC files inmarc
directory to XML withyaz-marcdump
, then split the files withsplit-xml.php
. Each new file contains maximum 10.000 records.to-utf8.sh
- convert each XML files in a directory to normal UTF-8 file with theuconv
tool. The MARC to XML converters do not deal with the decomposed character. This step is needed if the accented charcters in XML remain decomposed (such as an a + ¨ instead of ä). See Unicode normalization and Combining and precomposed characters.split-xml.sh
- splits MARCXML files inmarc
directory and place the new files intosplitted
. The script makes use of withsplit-xml.php
. Each new file contains 10.000 records the maximum. If you start with binary MARC you don't have to apply this step becausemarc-to-xml.sh
already contains it.xml-to-json.sh
- convert XML files insplitted
directory with Catmandu. Moves converted files toconverted
and .json tojson/raw
format-json.sh
- convert .json files injson/raw
into a more convenient JSON format. Saves the new files intojson/formatted
directory, moves the source file intojson/processed
marc
- put here the original binary MARC or MARCXML filessplitted
- the script puts the splitted XML files here temporaryconverted
- after JSON conversion the scripts moves here the splitted XML filesjson/raw
- the place of the Catmandu generated JSON files before formatjson/processes
- the final place of the Catmandu generated JSON filesjson/formatted
- the formatted JSON files. This is the end result of the process. If everything went correct, you can delete the content of the other directories.
Edit crontab with the
crontab -e
command and add the following line:
*/1 * * * * cd /to/working/directory && php toJsonLauncher.php >> launch-report.log
This script runs the one-file-to-json.sh
script on each files listed in the to-json-setlist.txt
file.
Edit crontab with the
crontab -e
command and add the following line:
*/1 * * * * cd /to/working/directory && php toFormattedLauncher.php >> launch-report.log
This script runs the one-json-to-formatted.sh
script on each files listed in the to-formatted-setlist.txt
file.