Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosen expectation of XML structure when finding the pageId #453

Closed

Conversation

mikegerber
Copy link
Contributor

@mikegerber mikegerber commented Mar 5, 2020

I encountered METS embedded in a OAI-PMH response, and while processing the result with OCR-D works somewhat, it fails to find the pageIds for every file in the METS.

Example OAI-PMH with METS:
https://digital.staatsbibliothek-berlin.de/oai?verb=GetRecord&metadataPrefix=mets&identifier=oai%3Adigital.staatsbibliothek-berlin.de%3APPN719671574

When saving that as mets.xml, ocrd workspace validate reports lots of errors like this one:

  <error>File 'FILE_0001_PRESENTATION' does not manifest any physical page.</error>

Fix this by loosening the expectation of the XML structure when finding the pageId. (There are more XPath strings in the code that could be reviewed, I think.)

I encountered METS embedded in a OAI-PMH response, and while processing
the result with OCR-D works somewhat it fails to find the pageIds for
every file in the METS.

Example OAI-PMH with METS:
https://digital.staatsbibliothek-berlin.de/oai?verb=GetRecord&metadataPrefix=mets&identifier=oai%3Adigital.staatsbibliothek-berlin.de%3APPN719671574

When saving that as mets.xml, ocrd workspace validate reports lots of
errors like this one:

  <error>File 'FILE_0001_PRESENTATION' does not manifest any physical page.</error>

Fix this by loosening the expectation of the XML structure when finding
the pageId. (There are more XPath strings in the code that could be
reviewed, I think.)
@codecov-io
Copy link

codecov-io commented Mar 5, 2020

Codecov Report

Merging #453 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #453   +/-   ##
=======================================
  Coverage   81.82%   81.82%           
=======================================
  Files          39       39           
  Lines        2316     2316           
  Branches      427      427           
=======================================
  Hits         1895     1895           
  Misses        348      348           
  Partials       73       73
Impacted Files Coverage Δ
ocrd_models/ocrd_models/ocrd_mets.py 94.54% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 126d0d8...3ce82c1. Read the comment docs.

@kba
Copy link
Member

kba commented Mar 6, 2020

When saving that as mets.xml

Why would you save an OAI-PMH response as a METS document? Can't you do

http GET 'https://digital.staatsbibliothek-berlin.de/oai?verb=GetRecord&metadataPrefix=mets&identifier=oai%3Adigital.staatsbibliothek-berlin.de%3APPN719671574' \
| xmlstarlet sel -t -c '//*[local-name()="mets"]' \
> mets.xml

to get the actual METS document?

@kba
Copy link
Member

kba commented Apr 9, 2020

Closing for now, please comment and I'll reopen if not solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants