Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version number for the software agent in generated METS #147

Closed
kba opened this issue Jul 20, 2018 · 7 comments
Closed

Version number for the software agent in generated METS #147

kba opened this issue Jul 20, 2018 · 7 comments
Assignees

Comments

@kba
Copy link
Member

kba commented Jul 20, 2018

No description provided.

@wrznr
Copy link
Contributor

wrznr commented Sep 12, 2018

@tboenig As our METS guru, please make a proposal here. Consider the following example from DDR presseportal:

    <mets:metsHdr RECORDSTATUS="production" LASTMODDATE="2013-01-14T08:55:09.109+01:00" CREATEDATE="2013-01-14T08:55:09.109+01:00">
        <mets:agent TYPE="ORGANIZATION" ROLE="CREATOR">
            <mets:name>Fraunhofer IAIS</mets:name>
        </mets:agent>
        <mets:agent OTHERTYPE="SOFTWARE" TYPE="OTHER" ROLE="CREATOR">
            <mets:name>Doculib Article Archive METS.SBB Exporter</mets:name>
        </mets:agent>
    </mets:metsHdr>

@wrznr
Copy link
Contributor

wrznr commented Sep 13, 2018

What we need:

  • information on the single processing steps and their type
  • software versions
  • ...

Related to OCR-D/spec#64

Maybe the tool json is a good starting point.

@tboenig
Copy link

tboenig commented Sep 14, 2018

Here is a proposal for the issue. It consists of a part of a future multipart Mets definition. This part has already been written in Mets profile format.

The specification (see below) of the elements is based on the METS schema documentation and quotes it in this documentation. The values and contents of the attributes RECORDSTATUS, OTHERTYPE and the element name were developed by the ocr-d project.
See:
METSPrimer: http://www.loc.gov/standards/mets/METSPrimer.pdf
METS schema: http://www.loc.gov/standards/mets/mets.xsd
METS schema documentation: http://www.loc.gov/standards/mets/docs/mets.v1-9.html
(Version1.19)

The sample METS-Header:

       <mets:metsHdr 
                 RECORDSTATUS="PRODUCTION"
                 LASTMODDATE="2013-01-14T08:55:09.109+01:00" 
                 CREATEDATE="2013-01-14T08:55:09.109+01:00">
       <mets:agent ID="ID0001"
                   TYPE="ORGANIZATION"
                   ROLE="CREATOR">
       <mets:name>OCR-D</mets:name>
      </mets:agent>
      <mets:agent ID="ID0002"
                   TYPE="OTHER"
                   OTHERTYPE="SOFTWARE"
                   ROLE="CREATOR">
       <mets:name>Doculib Article Archive METS.SBB Exporter</mets:name>
      </mets:agent>
      <mets:agent ID="ID0003"
                   TYPE="OTHER"
                   OTHERTYPE="ACTIVITIES"
                   ROLE="CREATOR">
       <mets:name>preprocessing/optimization/binarization</mets:name>
      </mets:agent>
   </mets:metsHdr>

The specification for elements and atributes see here:

<?xml version="1.0" encoding="UTF-8"?>
<METS_Profile xmlns:xlink="http://www.w3.org/1999/xlink"
 xmlns:xhtml="http://www.w3.org/1999/xhtml"
 xmlns="http://www.loc.gov/METS_Profile/v2"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.loc.gov/METS_Profile/v2 http://www.loc.gov/standards/mets/profile_docs/mets.profile.v2-0.xsd" STATUS="provisional" REGISTRATION="unregistered">


<structural_requirements>
 <metsHdr>
  <xhtml:p>The documentation of the elements is based on the METS schema documentation ()and quotes it in this documentation. the values and contents of the attributes RECORDSTATUS, OTHERTYPE and the element name were developed by the ocr-d project.</xhtml:p>
<xhtml:p>See: 
<xhtml:ul>
  <xhtml:li>METSPrimer: http://www.loc.gov/standards/mets/METSPrimer.pdf</xhtml:li>
  <xhtml:li>METS schema: http://www.loc.gov/standards/mets/mets.xsd</xhtml:li>
  <xhtml:li>METS schema documentation: http://www.loc.gov/standards/mets/docs/mets.v1-9.html 
                 (Version1.19)</xhtml:li>
</xhtml:ul>
</xhtml:p>
  
<requirement ID="metsHdr">
   <xhtml:p>The metsHdr element vs. metsHdr container must contain the following elements and attributes.</xhtml:p>
   <xhtml:h1>Sample the full METS Header</xhtml:h1>
   <xhtml:p><![CDATA[ 
   <mets:metsHdr 
                 RECORDSTATUS="PRODUCTION"
                 LASTMODDATE="2013-01-14T08:55:09.109+01:00" 
                 CREATEDATE="2013-01-14T08:55:09.109+01:00">
       <mets:agent ID="ID0001"
                   TYPE="ORGANIZATION"
                   ROLE="CREATOR">
       <mets:name>OCR-D</mets:name>
      </mets:agent>
      <mets:agent ID="ID0002"
                   TYPE="OTHER"
                   OTHERTYPE="SOFTWARE"
                   ROLE="CREATOR">
       <mets:name>Doculib Article Archive METS.SBB Exporter</mets:name>
      </mets:agent>
      <mets:agent ID="ID0003"
                   TYPE="OTHER"
                   OTHERTYPE="ACTIVITIES"
                   ROLE="CREATOR">
       <mets:name>preprocessing/optimization/binarization</mets:name>
      </mets:agent>
   </mets:metsHdr>
   ]]>
   </xhtml:p>
   <xhtml:h1>Documentation element metsHdr with attributs</xhtml:h1>
   <xhtml:h2>The identifier ID and ADMID</xhtml:h2>
   <xhtml:p><strong>ID (ID/O):</strong> This attribute uniquely identifies the element within the METS
         document, and would allow the element to be referenced unambiguously from another element or
         document via an IDREF or an XPTR. For more information on using ID attributes for internal and 
         external linking see Chapter 4 of the METS Primer.</xhtml:p>
   <xhtml:p>
    <strong>ADMID (IDREFS/O):</strong> Contains the ID attribute values of the &lt;techMD&gt;, &lt;sourceMD&gt;, &lt;rightsMD&gt; and/or &lt;digiprovMD&gt; elements within the &lt;amdSec&gt; of the METS document that contain administrative metadata pertaining to the METS document itself.  For more information on using METS IDREFS and IDREF type attributes for internal linking, see Chapter 4 of the METS Primer.
   </xhtml:p>
   
   
   <xhtml:h2>RECORDSTATUS</xhtml:h2>
   <xhtml:p><xhtml:ul>
    <xhtml:li><strong>RECORDSTATUS</strong> (string/O): Specifies the status of the METS document. 
                    It is used for internal processing purposes.</xhtml:li>
    <xhtml:li>PRODUCTION: Use if the mets metadataset describes the main usage.</xhtml:li>  
    <xhtml:li>SUPPLEMENT: Use if the mets metadataset represents a supplement to an other 
                    metadataset or an addition, an add-on, an appendix to the currently described object.</xhtml:li>
    <xhtml:li>REPLACEMENT: Use if the mets metadatset replaces an early mets metadatset.</xhtml:li>
    <xhtml:li>TEST: Use if the mets metadatset is used for testing.</xhtml:li>
    <xhtml:li>OTHER: Use OTHER if none of the preceding values pertain.</xhtml:li>
    </xhtml:ul>
   </xhtml:p>
   <xhtml:h2>LASTMODDATE and CREATEDATE</xhtml:h2>
   <xhtml:p>
    <xhtml:ul>
     <xhtml:li><strong>LASTMODDATE</strong> (dateTime/O): Is used to indicate the date/time the 
                    METS document was last modified.</xhtml:li>
    <xhtml:li><strong>CREATEDATE</strong> (dateTime/O): Records the date/time the METS document 
                    was created.</xhtml:li></xhtml:ul></xhtml:p>
    
    <xhtml:h3>LASTMODDATE and CREATEDATE date and time format:</xhtml:h3>
    <xhtml:p>The time format<br/>
     YYYY = four-digit year<br/>
     MM   = two-digit month (01=January, etc.)<br/>
     DD   = two-digit day of month (01 through 31)<br/>
     hh   = two digits of hour (00 through 23) (am/pm NOT allowed)<br/>
     mm   = two digits of minute (00 through 59)<br/>
     ss   = two digits of second (00 through 59)<br/>
     s    = one or more digits representing a decimal fraction of a second<br/>
     TZD  = time zone designator (Z and [+hh:mm or -hh:mm])
    </xhtml:p>
   
  </requirement>
  <requirement ID="metsHdr_agent">
   <xhtml:h1>Documentation element agent with attributs</xhtml:h1>
  
  <xhtml:p>It is also possible to use a number of agents for different activities and processes.</xhtml:p>
  <xhtml:h2>The identifier ID</xhtml:h2>
  <xhtml:p><strong>ID (ID/O):</strong> This attribute uniquely identifies the element within the METS document, and would allow the element to be referenced unambiguously from another element or document via an IDREF or an XPTR. For more information on using ID attributes for internal and external linking see Chapter 4 of the METS Primer.</xhtml:p>
  
  <xhtml:h2>TYPE</xhtml:h2>
  <xhtml:p>
    <strong>TYPE (string/O):</strong> is used to specify the type of AGENT. It must be one of the 
                   following values:<br/>
    INDIVIDUAL: Use if an individual has served as the agent.<br/>
    ORGANIZATION: Use if an institution, corporate body, association, non-profit enterprise, government, 
                               religious body, etc. has served as the agent.<br/>
    OTHER: Use OTHER if none of the preceding values pertain and clarify the type of agent specifier being 
                used in the OTHERTYPE attribute
    </xhtml:p>
   
   <xhtml:h2>OTHERTYPE</xhtml:h2>
       <xhtml:p><strong>OTHERTYPE (string/O):</strong> Specifies the type of agent when the value 
       OTHER is indicated in the TYPE attribute.<br/>
       SOFTWARE: a program or system that has been involved in the process of the IP<br/>
       ACTIVITIES: Use if a single processing step or a number of steps are performed. Warning, if you use 
       this value, you must use the default list of activities for the mets:name element. This list is published 
       under: https://ocr-d.github.io/glossary#activities
  </xhtml:p>
   
   
  <xhtml:h2>ROLE</xhtml:h2>
  <xhtml:p>
   <strong>
   ROLE (string/R):</strong> Specifies the function of the agent with respect to the METS 
                 record. The allowed values are:<br/>
   CREATOR: The person(s) or institution(s) responsible for the METS document.<br/>
   EDITOR: The person(s) or institution(s) that prepares the metadata for encoding.<br/>
   ARCHIVIST: The person(s) or institution(s) responsible for the document/collection.<br/>
   PRESERVATION: The person(s) or institution(s) responsible for preservation functions.<br/>
   DISSEMINATOR: The person(s) or institution(s) responsible for dissemination functions.<br/>
   CUSTODIAN: The person(s) or institution(s) charged with the oversight of a document/collection.<br/>
   IPOWNER: Intellectual Property Owner: The person(s) or institution holding copyright, trade or service 
                 marks or other intellectual property rights for the object.<br/>
   OTHER: Use OTHER if none of the preceding values pertains and clarify the type and location specifier 
                 being used in the OTHERROLE attribute (see below).</xhtml:p>
   <xhtml:h2>OTHERROLE</xhtml:h2>
  <xhtml:p>OTHERROLE (string/O): Denotes a role not contained in the allowed values set if OTHER is 
                 indicated in the ROLE attribute.</xhtml:p>
  </requirement>
  <requirement ID="metsHdr_metsname">
   <xhtml:h2>name</xhtml:h2>
   <xhtml:p>
    The element &lt;name&gt; can be used to record the full name of the document agent.
   </xhtml:p>
   <xhtml:p>
    If you use the value: ACTIVITIES in element mets:agent in attribute OTHERTYPE, you must use the 
   default list of activities for the mets:name element. This list is published under: 
   https://ocr-d.github.io/glossary#activities
   </xhtml:p>
  </requirement>
 </metsHdr>
</structural_requirements>
</METS_Profile>

@wrznr
Copy link
Contributor

wrznr commented Sep 14, 2018

@tboenig Maybe a solution in form of "real" code (i.e. Pull request) would be easier to review and comment on...

@kba
Copy link
Member Author

kba commented Sep 14, 2018

For this field I had something simpler in mind, like

      <mets:agent ID="ID0002"
                   TYPE="OTHER"
                   OTHERTYPE="SOFTWARE"
                   ROLE="CREATOR">
       <mets:name>ocrd_tesserocr v0.1.1 (ocrd/core v0.72)</mets:name>
      </mets:agent>

Like the User-Agent HTTP field or similar. A short string that ideally should have some recommended syntax but can be free-form and help humans later guess what happened here.

For true provenance, all the steps modeled as activities, logging, individual workflow task configurations etc. this will need to be more complex and should be aligned with process data repo (@VolkerHartmann) and long-term preservation archive (@krvoigt)

@VolkerHartmann
Copy link

I would also prefer the name of the tool and its version instead of the processing step.
Processing step would be also nice as additional information but as I see there is no additional attribute/element for this.
The dependencies may be a problem as there may be recursive dependencies.
No clue how to solve this.

@kba
Copy link
Member Author

kba commented Oct 30, 2018

Fixed by OCR-D/spec#89 and #191

@kba kba closed this as completed Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants