Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: send a BagIt bag to Archivematica for preservation #805

Closed
jraddaoui opened this issue Dec 6, 2023 · 12 comments · Fixed by #1007
Closed

Feature: send a BagIt bag to Archivematica for preservation #805

jraddaoui opened this issue Dec 6, 2023 · 12 comments · Fixed by #1007
Assignees

Comments

@jraddaoui
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Currently, all transfers started in Archivematica use the zipfile transfer type:

https://github.com/artefactual-sdps/enduro/blob/main/internal/am/start_transfer.go#L49

This is not an issue in the current implementation where the transfer is always bundled as a ZIP file. However, it limits the extensibility of the workflow; thinking in the particular case of the SFA fork, where the transfer is transformed into a zipped bag in the pre-processing activities:

https://github.com/artefactual-sdps/enduro-sfa/pull/4/files#diff-ae98fc39bbc9e053ec8d1d2ed56184cd9ba7ea280d3e72975617da81c3cfadd3

Describe the solution you'd like

Provide a configuration setting like the one used for the processing configuration:

https://github.com/artefactual-sdps/enduro/blob/main/enduro.toml#L99

Describe alternatives you've considered

Allow changing the transfer type value in workflow. Thinking about the possibility of using child workflows to manage that extensibility, another option could be to indicate the transfer type in the child workflow result.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Dec 6, 2023

@jraddaoui I agree it would be better to allow different transfer types to be sent to Archivematica, but in the current processing workflow the bundle activity will convert an incoming Bag transfer into a standard transfer which is then zipped and sent to AM (or a3m). Allowing a Bag transfer to be sent to Archivematica will require removing the bundle activity from the AM workflow or updating it to support multiple output transfer types.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Dec 6, 2023

Note: the conversion of Bags -> standard transfer is a decision that was made for the a3m preservation engine, and I decided to retain this convention when adding Archivematica as a preservation engine option.

@jraddaoui
Copy link
Collaborator Author

I'll create another issue talking about that bundle activity, this is all looking forward to have an extensible pre-processing option and it will help if we have a child workflow for those activities later on. Then we should discuss where should the bundle activity be located (if needed), looking at the conceptual design bundling seems like a responsibility for pre-processing. And in the SFA fork we are skipping the bundle activity right now.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Dec 6, 2023

@jraddaoui okay, but I don't see any point in making the AM transfer type configurable without addressing bundle activity - Enduro will always deliver a zipped standard transfer to AM. In the SFA case you've already modified the Enduro code, so just changing the transfer type in the code is a simpler solution then adding a config variable.

@sallain
Copy link
Collaborator

sallain commented Mar 5, 2024

Note from today's meeting: @djjuhasz, @jraddaoui, and @sallain to review this issue and decide what pieces of work need to be completed to support SFA and MoMA.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Mar 6, 2024

I have a proposal for how to handle the SIP format delivered to the preservation system by Enduro. My proposal is based on the supposition that a BagIt Bag is the best SIP format for Enduro to send to the preservation system, but recognizes that a3m currently can't process Bagged SIPs.

I believe a BagIt Bag should be the preferred SIP format because:

  1. It's an open standard, and Artefactual prefers implementing open standards where possible.
  2. There is existing tooling to create and validate Bags, which includes validating file checksums and the package contents vs. manifest. Using the existing Bag tools saves us work of implement and maintaining our own SIP creation and validation tools.

My proposed solution for the Enduro SIP type

  1. Remove the Bundle activity from Enduro, and make repackaging the SIP (when necessary) a concern of Preprocessing. I think it makes sense for any code that modifies the structure or contents of a transfer to be implemented in Preprocessing.
  2. Specify that Preprocessing must deliver a BagIt Bag to Enduro upon successful completion. This allows us to run a Bag validator to confirm that the SIP produced by Preprocessing meets the expectations of Enduro (and the preservation system).
  3. Move the unbag function from the Bundle activity to a stand alone "Unbag Activity", which is only run in the a3m workflow. The Unbag Activity will convert the BagIt SIP from Preprocessing to an Archivematica standard transfer. If a3m implements Bag processing in the future, the Unbag Activity can be removed from Enduro.
  4. In the Archivematica preservation workflow, always send a Bagged SIP to Archivematica for preservation. Update the start transfer API request "Type" value to "zipped bag".

@sallain @jraddaoui what do you think? If you have a counter-proposal or any suggested modifications to my proposal, I'd love to hear your ideas.

@sallain
Copy link
Collaborator

sallain commented Mar 12, 2024

I think that this is a good idea for the following reasons:

  • The output from pre-processing becomes a predictable entity - no matter the structure of the data submitted to pre-processing, the output is always a bag
  • Enduro establishes some control over the contents being preserved before the contents are sent to the preservation application through the creation of checksums and bag-info.txt
  • It allows us to confirm that no data corruption occurred between the pre-processing application and Archivematica, as Archivematica will check the bag's checksums upon receipt
  • The bag standard is widely used and we (Artefactual) have a lot of experience both creating and working with bags
  • The transformation will not impact how users create and upload their materials to Enduro, nor how the AIP is ultimately generated
  • Archivematica already handles bags, including checking checksums and parsing bag-info.txt

I also completely agree that this should all occur in pre-processing.

Here are a few things to consider:

  • What if an institution uploads a bag? Rebagging should be avoided, in my opinion, partly due to redundancy but also because the bagfiles should be respected and used as intended (e.g. the originally-uploaded checksum file should be the one that is checked in Archivematica/a3m; the original bag-info.txt should be parsed by Archivematica into the METS)
  • a3m cannot currently accept bags, as noted above, and immediately having to run the unbag activity is a bit silly. Assuming we adopt this proposal, I'd like to see the implementation of bag processing in a3m as soon as possible.

I'm sure that there are other considerations as well, but for the most part I think that this is a solid proposal.

@Diogenesoftoronto
Copy link
Contributor

Diogenesoftoronto commented Mar 13, 2024

I'm sure that there are other considerations as well, but for the most part I think that this is a solid proposal.

I would like to outline one of the considerations that is missing here. That consideration is that our current way of validating bags uses a very early, and not well tested bagit library in go. see https://github.com/nyudlts/go-bagit and nyudlts/go-bagit#7 (comment). It would require some work to make this fully featured and complaint bag validator according to spec.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Mar 13, 2024

@sallain I agree that we should avoid rebagging a transfer that is submitted as a Bag and that adding Bag processing to a3m ASAP would avoid having to unbag the bag we just bagged. :P

@Diogenesoftoronto yes, good points about the https://github.com/nyudlts/go-bagit library. I was assuming we would use https://github.com/LibraryOfCongress/bagit-python for Bag validation, but it being a Python tool definitely makes it more challenging to integrate than a native Go library. It also looks like bagit-python is not being actively maintained, and requires Python 2 which was sunset in January 2020.

@sallain
Copy link
Collaborator

sallain commented Mar 15, 2024

I was discussing this with @fiver-watson and he pointed out that there may be circumstances where a user submits a bag, but other activities in the pre-processing application mean that the original bag is invalid (ex. transforming or adding metadata files), meaning that the bag WOULD have to be rebagged. Just something to consider.

@sallain
Copy link
Collaborator

sallain commented Apr 5, 2024

We spent some time last week hashing out a workflow diagram. This is the result. It can be found on the Implementation Services team Miro board

image

@djjuhasz
Copy link
Collaborator

djjuhasz commented Apr 8, 2024

@sallain the workflow diagram looks good to me. 👍

@sallain sallain added this to Enduro May 29, 2024
@sallain sallain moved this to To do in Enduro May 29, 2024
@sallain sallain moved this from To do to In Progress in Enduro May 29, 2024
@djjuhasz djjuhasz moved this from ⏳ In Progress to 👍 Ready in Enduro May 30, 2024
@djjuhasz djjuhasz changed the title Problem: Archivematica transfer type can't be configured Feature: send a BagIt bag to Archivematica for preservation May 30, 2024
@djjuhasz djjuhasz removed their assignment Jun 25, 2024
@djjuhasz djjuhasz self-assigned this Aug 9, 2024
@djjuhasz djjuhasz moved this from 👍 Ready to ⏳ In Progress in Enduro Aug 9, 2024
djjuhasz added a commit that referenced this issue Aug 16, 2024
Fixes #805

- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica, if it's not already a
  bag
- Add "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
djjuhasz added a commit that referenced this issue Aug 16, 2024
Fixes #805

- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica (if it's not already a
  bag)
- Add a "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
djjuhasz added a commit that referenced this issue Aug 19, 2024
Fixes #805

- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica (if it's not already a
  bag)
- Add a "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
djjuhasz added a commit that referenced this issue Aug 22, 2024
Fixes #805

- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica (if it's not already a
  bag)
- Add a "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
djjuhasz added a commit that referenced this issue Aug 22, 2024
Fixes #805

- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica (if it's not already a
  bag)
- Add a "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
djjuhasz added a commit that referenced this issue Aug 22, 2024
Fixes #805

- Move the bundle activity to the a3m branch of the processing workflow
- Change the package type to "zipped bag" when starting a transfer via
  the Archivematica API
- Bag the PIP before sending it to Archivematica (if it's not already a
  bag)
- Add a "TransferSourcePath" config value to specify the API path to the
  Transfer Source directory where PIPs are uploaded
@github-project-automation github-project-automation bot moved this from ⏳ In Progress to 🎉 Done in Enduro Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants