-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Improve validation scripts to use Go and package XSDs #22
Comments
This is already done as part of #11. However, their They were generating the Bag manually, writing the manifest with the checksums from the metadata file and then validating the final Bag, which allowed to calculate the checksums only once and validate them at the same time. Now, we'll be generating the Bag automatically in the child workflow and validating it in Enduro, so we will be ignoring those metadata checksums. To avoid generating the checksums twice, @fiver-watson suggested to validate them after Bag creation, using the generated checksums from the Bag manifest and validate them against those in metadata.xml (or UpdatedAreldaMetadata.cml), I added a TODO comment at the end of the workflow code to do that. That will imply knowing the transformation made to the SIP during the workflow to match the file paths from both sources, which should be easy to do, but it adds a validation activity at the end of the workflow after transformation and Bag creation and I wonder if that is a good choice, considering all validation tasks will happen before and will show up in the UI in that order. They are using MD5 checksums, which are not so expensive to calculate (compared to sha algorithms), so I wonder if validating them before transformation makes more sense, even if that requires calculating them twice. @fiver-watson, @sallain thoughts? |
@jraddaoui I would be okay with doing it that way. However, another factor for consideration... (cc @sallain to make sure i am getting this right): During our most recent metadata WG meeting, SFA shared what they would ideally like to see us do with the metadata.xml / UpdatedArelda file - i.e. extract the userful information and turn it into other AM inputs to be included in the SIP, as well as create a version to send to the AIS (details on that to come - waiting on a copy of their current XSD and some example input / output files). See slide 2 in this deck from SFA: If we proceed with this work, we will need to find and extract all the MD5 checksums early in the workflow anyway, before anything is sent to Archivematica or bagged etc. If that's the case, then it makes sense to me to use the checksum file we generated to quickly validate them as well, since we would now have a file relating each checksum to an object (per Archivematica's requirements for the checksum file). Haven't done all the analysis yet on these potential new ingest extraction tasks, but wanted to mention this early in case it influences how we approach this issue! |
What's the point of that checksums file if we are passing a Bag? I think the Bag will be validated twice already (Enduro and AM), including the manifest files with the checksums. |
A) They want to keep track of the original checksums for chain of custody AFAIK the bagging is our process, not something they care about. They want the same checksums that were generated during the digitization process to be maintained. I'm not sure if there's a way to use the existing MD5 checksums in bag creation, but for this particular case if we are going to bag the content, that would be an option too. |
We are keeping the original metadata XML file.
We/They could use the ones in the metadata XML file too, or the ones from the Bag. |
lol, just passing on what i heard - let's discuss during sync |
When we talked to SFA during that metadata WG meeting, I forgot that Archivematica will be receiving zipped bags and therefore the bag's payload manifest will contain checksums, meaning that the checksum.md5 file is not needed. What SFA needs is for the incoming checksums to be validated and to be able to pass those same checksum values to AIS. This is what I propose:
Once Archivematica receives the bag, it will run a fixity check on the bag checksums, which Enduro will have already confirmed match the metadata.xml/UpdatedAreldaMetadata.xml checksums. Archivematica will generate two PREMIS events: a This actually feels pretty solid to me - checksums are validated at upload, bag generation, when sent to Archivematica, and just before storage. As to which source Enduro then uses to pass the checksums to AIS, it doesn't really matter as long as the same algorithm is used throughout (SFA is an md5 shop) - the checksums all match, so whether AIS picks it up from the METS, the metadata file, or another source should be irrelevant. Any concerns/is there anything I've missed? |
For the schema validation I thought we could try the following approach: https://github.com/lestrrat-go/libxml2#xsd-validation This package is a Go interface binding the Other options:
|
We discussed this at some point but adding a note here for reference. I had a chat with @sevein about using C bindings in Go (CGO), he mentioned that it has several implications:
Considering all that we may be better as we are, with the benefits of running a subprocess (he mentioned the memory issues in AM running lxml/libxml2). Additionally, he suggested that we could experiment creating a WebAssembly module and running it with wazero. |
Is your feature request related to a problem? Please describe.
Pilot project work made use of Python scripts provided by SFA. In PoC#1 we did a bit of cleanup, adding some golang wrappers and handling. However, as SFA's needs evolve, the original scripts need some updating. Rather than updating the Python and having to maintain that dependency, we should rewrite them in Go so that they are aligned with Enduro's core language.
A second aspect of the validation scripts that we should amend is the use of locally stored XSDs. This was acceptable when we were only validating one type of SIP, but now we must support multiple transfer types which may comply with a different version of the XSDs. We should use the XSDs that are included in each SIP in order to ensure that the correct XSD version is used.
Describe the solution you'd like
This is a catch-all card for updating the scripts in general. The following list is a start:
More tasks can be added to this card as needed!
Describe alternatives you've considered
None
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: