Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid bagit due to xml embedded in bag-info.txt? #420

Open
netsensei opened this issue Aug 27, 2024 · 1 comment
Open

Invalid bagit due to xml embedded in bag-info.txt? #420

netsensei opened this issue Aug 27, 2024 · 1 comment

Comments

@netsensei
Copy link

netsensei commented Aug 27, 2024

Hi!

When I try to ingest a basic bag created with roda-in in a roda community edition instance, the ingest will fail with this error in the UI:

image

I'm following these steps:

  1. Open roda-in
  2. Open a directory containing a directory named prefix-0001234 with a single PDF file.
  3. Pick "create classification scheme" in the middle panel.
  4. I drag the directory containing the single PDF to the middle panel.
  5. I choose "One information package for each selected files or folders".
  6. I choose "Create new metadata from template" > "Dublin core".
  7. Hit "confirm"
  8. Select the package and then go to the metadata panel.
  9. I add a my name as a creator in the creator field of the form.
  10. I hit "Create SIPs"
  11. In the subsequent form, I choose "Export all items", leaving all other items disabled. I also choose "BagIt" as the export format.

The result is a ZIP file which I then try to upload into RODA following these steps:

  1. I go to "ingest" > "transfer".
  2. Pick "Upload" from the dropdown, and upload the ZIP file via the form.
  3. I check the uploaded ZIP and pick the "Start new process" option from the dropdown.
  4. I then choose "Default ingest workflow" > "BagIt" as input format for the SIP > "Create" to start the process.
  5. After waiting, I see the workflow failing with the above error.

Looking inside the bag, I notice this structure in bag-info.txt:

metadata.dc.xml: <?xml version="1.0" encoding="UTF-8"?>
<simpledc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="../schemas/dc.xsd">
   <title>prefix-0001234</title>
   <identifier>uuid-42e36734-fa04-4251-adf5-b0743830ddfe</identifier>
   <creator>Netsensei</creator>
   <language>English</language>
</simpledc>

level: item
id: uuid-42e36734-fa04-4251-adf5-b0743830ddfe
title: prefix-0001234
vendor: commons-ip
Payload-Oxum: 87674859.1
Bagging-Date: 2024-08-27

Is embedding XML in a bagit-info.txt file correct / valid? I've tried validating the bag with Bagger and bagit-python. Both fail to verify the bag, but then again, those also verify against version 0.97 of the specification, while bagit.txt contains BagIt-Version: 1.0.

RODA-In version: 2.7.3

Thank you for looking into this.

@netsensei
Copy link
Author

netsensei commented Sep 11, 2024

Having looked a bit further into this, I've noticed that the problem is with the generated bag-info.txt file. Both RODA and RODA-in use the commons-ip library. After writing a quick Java program to test this library, I was able to read the output of the BagitSIP.getValidationReport function.

Turns out that validation fails due to these errors:

Line [] does not meet the Bagit specification for a bag tag file. Perhaps you meant to indent it by a space or a tab? Or perhaps you didn't use a colon to separate the key from the value? It must follow the form of : or if continuing from another line must be indented by a space or a tab.

and

Line [] does not meet the Bagit specification for a bag tag file. Perhaps you meant to indent it by a space or a tab? Or perhaps you didn't use a colon to separate the key from the value? It must follow the form of : or if continuing from another line must be indented by a space or a tab.

Changing the bag-info.txt to the below and re-packaging the bag before uploading fixes the validation errors in RODA:

metadata.dc.xml: <?xml version="1.0" encoding="UTF-8"?>
    <simpledc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="../schemas/dc.xsd">
      <title>prefix-0001234</title>
      <identifier>uuid-42e36734-fa04-4251-adf5-b0743830ddfe</identifier>
      <creator>Netsensei</creator>
      <language>English</language>
    </simpledc>
level: item
id: uuid-42e36734-fa04-4251-adf5-b0743830ddfe
title: prefix-0001234
vendor: commons-ip
Payload-Oxum: 87674859.1
Bagging-Date: 2024-08-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant