Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces` #544

Open
kba opened this issue Jul 23, 2020 · 12 comments
Open

RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces` #544

kba opened this issue Jul 23, 2020 · 12 comments
Assignees

Comments

@kba
Copy link
Member

kba commented Jul 23, 2020

METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.

Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.

Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.

OCR-D/core should provide a processor ocrd-sanitize which is only concerned with "housekeeping" of workspaces. Possible actions include:

  • Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
  • regex-based replacement of all xlink:href to match local conventions
  • Removing all but the lowest level of page:TextEquiv information in PAGE-XML
  • Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
  • Upgrading older PAGE-XML namespaces to the latest version (bashlib/ocrd_wrap_cli_processor: show help if METS does not exist #503)
  • Assigning persistent identifiers to work, pages, files ...

These are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.

@kba
Copy link
Member Author

kba commented Jul 23, 2020

@mikegerber

here is my collection of METS/PAGE file fixer scripts, as mentioned in the call: https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers - not to be used lightly, no warranty, you have been warned 🚧 🚨 🚧

@mikegerber
Copy link
Contributor

mikegerber commented Jul 23, 2020

I don't know if I missed the point a bit, but I do see two different groups of use cases here:

  1. Sanitizing/Repairing/maintaining invalid or outdated METS/workspaces:
  1. Other post-processing
  • Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
  • Removing all but the lowest level of page:TextEquiv information in PAGE-XML
  • Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing

Should these use case groups maybe put into two separate processors/tools?

@kba
Copy link
Member Author

kba commented Jul 23, 2020

Should these use case groups maybe put into two separate processors/tools?

Yes, probably. Or even task-specific processors (ocrd-sanitize-prune-filegroups, ocrd-sanitize-textequiv ...)

@kba
Copy link
Member Author

kba commented Jul 23, 2020

Of interest in this context: https://github.com/tboenig/AletheiaTools

@kba
Copy link
Member Author

kba commented Jul 23, 2020

Another useful operation: Assign pcGtsId from the mets:file/@ID

@mikegerber
Copy link
Contributor

mikegerber commented Jul 23, 2020

Another useful operation: Assign pcGtsId from the mets:file/@ID

https://github.com/mikegerber/sbb-useful-hacks/blob/master/mets-fixers/fix-page-pcgtsid-to-be-mets-file-id

@M3ssman
Copy link
Contributor

M3ssman commented Jul 24, 2020

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:

mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)

@kba
Copy link
Member Author

kba commented Jul 24, 2020

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:

mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)

Let's keep OAI-PMH in a separate issue, c.f. #539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see #453 (comment)

@M3ssman
Copy link
Contributor

M3ssman commented Jul 24, 2020

Snippet for METS/MODS fileGrp, using wl/bl approach:

def clear_fileGroups(xml_root, black_list=None, white_list=None):
    
    file_sections = xml_root.findall('.//mets:fileSec', XMLNS)
    
    if not file_sections or (len(file_sections) < 1):
        raise Exception('invalid xml data !')

    for file_section in file_sections:
        sub_groups = list(file_section)
        for sub_group in sub_groups:
            subgroup_label = sub_group.attrib['USE']
            if black_list:
                for fg in black_list:
                    if subgroup_label== fg:
                        file_section.remove(sub_group)
                        sanitze_pysical_strctMap(xml_root, subgroup_label)
            if white_list:
                if not subgroup_label in white_list:
                    file_section.remove(sub_group)
                    sanitze_pysical_strctMap(xml_root, subgroup_label)


def sanitze_pysical_strctMap(xml_root, file_ref):
    
    pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)
    
    for page in pages:
        removals = []
        for fptr in page:
            file_id = fptr.attrib['FILEID']
            if file_ref in file_id:
                removals.append(fptr)
        if removals:
            for removal in removals:
                page.remove(removal)

@M3ssman
Copy link
Contributor

M3ssman commented Aug 5, 2020

Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped.

@bertsky
Copy link
Collaborator

bertsky commented Jun 24, 2021

My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.

According to this we are already close, but...

  • our ALTO must be v2.0 currently (see this issue) – unfortunately the DFG-Viewer profile does not say much more, although we already know that SP/newlines are an issue and /alto/Layout/Page/@WIDTH is extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordingly
  • that means the XSLT from ocr-filetransform will not in general give the correct results for OCR-D generated PAGE, we should switch and recommend/document page-to-alto
  • our METS itself needs to conform to DFG-Viewer profile, which means that notably
    • images must be in the DEFAULT fileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure)
    • ALTO must be in the FULLTEXT fileGrp (not sure what to do if multiple versions are available) and MIMETYPE="text/html" (not application/alto+xml!)
    • files must be of LOCTYPE="URL" (but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correct Content-Type MIME or may omit it or use some nonsense like application/octet-stream)
    • for every mets:file there must be exactly one FLocat (which was already discussed within the remote-local bookkeeping and partial manifestation idea)
    • there must be a structMap of TYPE="PHYSICAL" with a mets:div of TYPE="physSequence" in it and at least one mets:div in that with TYPE="page" (i.e. at least one page) and a ORDER label
    • there must be a structMap of TYPE="LOGICAL" with a mets:div of some TYPE in it ("the name is not important") and at least one mets:div in that with TYPE among these labels
    • there must be a structLink linking each physical page to at least one logical element
    • there must be a mets:dmdSec with at least some MODS or TEIHDR metadata
    • there must be a mets:amdSec with at least some mets:techMD or external namespace metadata and some mets:rightsMD (with various dv:rights specs) and mets:digiprovMD (with dv:reference)

@bertsky
Copy link
Collaborator

bertsky commented Jun 25, 2021

I stand corrected: As this example by @stefanCCSMETS and ALTO – shows, MIMETYPE="application/alto+xml" and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants