Skip to content

Toolchain: CONTENTdm newspapers

Brandon Weigel edited this page Oct 4, 2018 · 22 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of newspaper issues by retrieving issue metadata and content files directly from CONTENTdm. This content can then be merged with master images for the newspaper pages. The resulting Islandora import packages can then be ingested into Islandora using the Islandora Nerwspaper Batch module.

This toolchain can be configured so that all the datasteams other than page-level OBJ files are retrieved from CONTENTdm. The Islandora Newspaper Batch module will load files in the page-level directories that have names that correspond to datastream IDs. This, in combination with enabling Islandora's "Defer derivative generation during ingest" setting, can speed ingestion of newspaper content substantially.

Preparing the content files

The file getter used in this toolchain (CdmNewspapers) assumes that files that correspond to the OBJ datastream for newspaper pages (either TIFFs, JPEG 2000, or JPG) are organized so that all files for a newspaper issue are within a single directory whose name is the publication date of the newspaper issue in yyyy-mm-dd format, and that each page file is named using the newspaper issue date (again in yyyy-mm-dd format) with a page sequence number appended to it (-01, -02, etc.), like this:

├── 1930-05-01
│   ├── 1930-05-01-01.tif
│   ├── 1930-05-01-02.tif
│   ├── 1930-05-01-03.tif
│   ├── 1930-05-01-04.tif
│   ├── 1930-05-01-05.tif
│   ├── 1930-05-01-06.tif
│   ├── 1930-05-01-07.tif
│   ├── 1930-05-01-08.tif
│   ├── 1930-05-01-09.tif
│   ├── 1930-05-01-10.tif
│   ├── 1930-05-01-11.tif
│   └── 1930-05-01-12.tif
├── 1930-05-10
│   ├── 1930-05-10-01.tif
│   ├── 1930-05-10-02.tif
│   ├── 1930-05-10-03.tif
│   ├── 1930-05-10-04.tif
│   ├── 1930-05-10-05.tif
│   ├── 1930-05-10-06.tif
│   ├── 1930-05-10-07.tif
│   ├── 1930-05-10-08.tif
│   ├── 1930-05-10-09.tif
│   ├── 1930-05-10-10.tif
│   ├── 1930-05-10-11.tif
    └── 1930-05-10-12.tif

These issue-level directories can be arranged in any order under the top top-level directory specified in the [FILE_GETTER] section's input_directory value. In other words, the do not need to follow a specific hierarchical arrangement.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Note: if you set verify_ca to false, you are bypassing HTTPS encryption between MIK and the remote website. Use at your own risk.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'
verify_ca = false

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = newspaper_test
last_updated_on = "2016-02-01"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file contains the following entries:

  • class: Required. Must be 'Cdm'.
  • alias: Required. The CONTENTdm alias (collection string) for the source collection, without the leading /.
  • temp_directory: Required. Full path to the directory where the fetchers write data for use later in the toolchain.
  • ws_url: Required. The full URL to your CONTENTdm server's web services API endpoint.
  • use_cache: Optional; set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
  • record_key: Required. Must be 'pointer'.

Example

[FETCHER]
class = Cdm
alias = ctimes
temp_directory = "m:\production_loads\chinesetimes\temp"
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
record_key = pointer

The METADATA_PARSER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Required. Must be 'mods\CdmToMods' or 'templated\Templated'. Use the former if simple source field-to-MODS-element mappings are sufficient for your needs, the latter if your source metadata requires complex logic to be converted to MODS.
  • alias: Required. CONTENTdm alias (collection string) for the source collection, without the leading /.
  • ws_url: Required. The full URL to your CONTENTdm server's web services API endpoint.
  • mapping_csv_path: The path, either full or relative to the mik script, where the metadata mapppings file is located.
  • include_migrated_from_uri: Required. If set, adds an <identifier> element to the object's MODS XML that indicates the source object's reference URL in CONTENTdm. For example: <identifier type="uri" invalid="yes" displayLabel="Migrated From">http://content.lib.sfu.ca/cdm/ref/collection/CT_1930-34/id/17583</identifier>.
    • To set: Use your CONTENTdm instance's base URL for browsing objects. In the above example, you would enter include_migrated_from_uri = 'http://content.lib.sfu.ca/cdm/ref/collection/'.
    • To skip creating this element, leave the value empty: include_migrated_from_uri =.
  • repeatable_wrapper_elements: Optional. By default MIK reduces repeated top-level wrapper MODS elements (same element name with the same attributes) down to a single instance of the element. This setting lets you indicate which elements you want to be repeated (i.e, have multiple of) in your MODS. The most common use for this setting is to allow repeated <extension> elements.

Example

[METADATA_PARSER]
class = mods\CdmToMods
alias = ctimes
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
; Path to the csv file that contains the CONTENTdm to MODS mappings.
mapping_csv_path = 'extras/sfu/mappings_files/chinesetimes_mappings.csv'
; Include the migrated from uri into your generated metadata (e.g., MODS)
include_migrated_from_uri = TRUE
repeatable_wrapper_elements[] = extension

The FILE_GETTER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CdmNewspapers'.
  • input_directory: Required. The full path to the directory where the content files are located. The files should be named as described in the "Preparing the content files" section above. Note that the more specific the paths in this option, the faster MIK will be; using multiple specific paths instead of a single high-level directory path here is recommended. You can also use the FilterCdmNewspaperMasterPaths file getter manipulator to help specify the set of paths that MIK looks for master files in.
  • temp_directory: Required. Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.
  • ws_url: Required. The full URL to your CONTENTdm server's web services API endpoint.
  • utils_url: Required. The full URL to your CONTENTdm server's web "utilities" directory. More information is available in the API entries listed under the "CONTENTdm Website API Reference — utils" section of the CONTENTdm API documentation.

Example

[FILE_GETTER]
class = CdmNewspapers
input_directories[] = "z:\Chinese Times"
input_directories[] = "z:\Chinese Times\1987"
input_directories[] = "z:\Chinese Times\1988"
input_directories[] = "z:\Chinese Times\1989"
alias = ctimes
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
utils_url = "http://content.lib.sfu.ca/utils/"

The WRITER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Required. Must be 'CdmNewspapers'.
  • output_directory: Required. The full path to the directory where output packages are written.
  • metadata_filename: Required. Must be 'MODS.xml'.
  • postwritehooks: Optional. A multivalued list of post-write hook scripts. Values have two parts, the full path to the PHP, Python, or shell executable, and the full path to the script itself.
  • datastreams: Optional. A multivalued list of datastream files that you want MIK to create. If not included, MIK will create all the files that the various file getter, metadata parser, and writer classes used in the toolchain can create. If included, only the indicated datastream files will be generated; if not included, all datastreams will be created. Most useful for testing metadata generation, for example datastreams[] = "MODS", which would tell MIK to generate only a MODS.xml file for each object.

Example

[WRITER]
class = CdmNewspapers
alias = ctimes
output_directory = "m:\production_loads\chinesetimes"
metadata_filename = 'MODS.xml'
postwritehooks[] = "php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "php extras/scripts/postwritehooks/generate_fits.php"
postwritehooks[] = "php extras/scripts/postwritehooks/object_timer.php"
datastreams[] = MODS
datastreams[] = OBJ
; datastreams[] = JP2
; datastreams[] = TN
; datastreams[] = JPG
; datastreams[] = OCR

The MANIPULATORS section

This section of the CSV toolchain's configuration file defines which manipulators should be used. Multiple manipulators can be defined for each type (fetchermanipulators, filegettermanipulators, metadatamanipulators) as illustrated below. The value of each entry is the manipulator class name plus any pip-separated parameters that the manipulator may require. Entries in this section are optional.

Example

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|2"
fetchermanipulators[] = "SpecificSet|configs\ctimes_pointers.txt"
metadatamanipulators[] = "FilterModsTopic|subject"
metadatamanipulators[] = "AddContentdmData"
metadatamanipulators[] = "AddUuidToMods"

Manipulators that you mayfind useful with this toolchain include:

The CdmNewspapers writer creates a basic MODS file for each newspaper page, like this:

<?xml version="1.0"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
  <titleInfo>
    <title>Page 8</title>
  </titleInfo>
  <identifier type="uri" invalid="yes" displayLabel="Migrated From">http://content.lib.sfu.ca/cdm/ref/collection/CT_1930-34/id/17578</identifier>
  <identifier type="uuid">7b040613-e537-4b47-ad8f-0077cf2226c8</identifier>
</mods>

This MODS assigns the page's title, and also includes any other elements such as <identifier> elements containing the URL of the corresponding page in CONTENTdm or a UUID, that are added via configuration options in the .ini file.

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: Required. The full path to the standard log generated by MIK.
  • path_to_manipulator_log: Required. The full path to the log that the manipulators write status and error messages to.

Example

[LOGGING]
path_to_log = "/tmp/newspaper_output/mik.log"
path_to_manipulator_log = "/tmp/newspaper_output/manipulator.log"
Clone this wiki locally