Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: validate file checksums from XML metadata #40

Closed
jraddaoui opened this issue Aug 7, 2024 · 4 comments · Fixed by #54
Closed

Feature: validate file checksums from XML metadata #40

jraddaoui opened this issue Aug 7, 2024 · 4 comments · Fixed by #54

Comments

@jraddaoui
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The initial Python script used for Bag creation (here) was creating the Bag manually, using the checksums from the XML metadata file to create the manifest, and then validating the final Bag with bagit-python.

We changed the Bag creation process to use go-bagit, that creates the manifest automatically so we are loosing that validation.

We are adding an activity to check the existence of the files from XML metadata in the SIP (here), but that is not currently validating those checksums either.

Describe the solution you'd like

Add checksum validation in a new activity or include it in the activity mentioned above.

Additional context

See the discussion about where should we place that validation, we may want to do it alongside all the validation tasks (generating new checksums) or after Bag creation (re-using the checksums generated in that process):

#22 (comment)

@jraddaoui jraddaoui added this to Enduro Aug 7, 2024
@jraddaoui jraddaoui moved this to 👍 Ready in Enduro Aug 7, 2024
@djjuhasz djjuhasz self-assigned this Sep 4, 2024
@djjuhasz djjuhasz moved this from 👍 Ready to ⏳ In Progress in Enduro Sep 4, 2024
@djjuhasz
Copy link
Contributor

djjuhasz commented Sep 4, 2024

I think we should verify the manifest checksums against the actual file checksums right after verifying the file manifest. While checking the checksums early requires generating the checksums from the file contents an extra time (in addition to when the bag is created), I think it's still better because if the checksums don't match the manifest we can stop processing before transforming the SIP. Stopping the workflow early saves the work of restructuring the SIP, writing the premis.xml file, and bagging the SIP. Stopping early also means the original SIP will be sent to the "failed" bucket rather than a modified version of the SIP with a partially complete premis.xml file.

It might be better to not parse the metadata.xml / UpdatedAreldaMetatdata.xml file twice (in "verify manifest" and "verify checksum" activities), but I'll have to consider alternatives.

One option would be for the "verify metadata" activity to write a metadata.csv file in the format recognized by Archivematica, and then "verify checksum" would read that file. The downside of this solution is verify checksums still has to parse metadata.csv, and it's a pretty arbitrary format considering that Archivematica would never receive this particular metadata.csv as we are returning a bag, with bag checksum metadata.

Another possible option would be to return a file list with checksums from the "verify manifest" activity, though this could run into activity payload limit for very large SIPs. 🤔

@jraddaoui
Copy link
Contributor Author

I think we should verify the manifest checksums against the actual file checksums right after verifying the file manifest. While checking the checksums early requires generating the checksums from the file contents an extra time (in addition to when the bag is created), I think it's still better because if the checksums don't match the manifest we can stop processing before transforming the SIP. Stopping the workflow early saves the work of restructuring the SIP, writing the premis.xml file, and bagging the SIP. Stopping early also means the original SIP will be sent to the "failed" bucket rather than a modified version of the SIP with a partially complete premis.xml file.

Sounds good to me, reusing the Bag checksums could be an optimization to do later on, when/if is needed. One small detail, right now if the SIP fails in preprocessing, we send the original SIP to the failed bucket.


It might be better to not parse the metadata.xml / UpdatedAreldaMetatdata.xml file twice (in "verify manifest" and "verify checksum" activities), but I'll have to consider alternatives.

One option would be for the "verify metadata" activity to write a metadata.csv file in the format recognized by Archivematica, and then "verify checksum" would read that file. The downside of this solution is verify checksums still has to parse metadata.csv, and it's a pretty arbitrary format considering that Archivematica would never receive this particular metadata.csv as we are returning a bag, with bag checksum metadata.

Another possible option would be to return a file list with checksums from the "verify manifest" activity, though this could run into activity payload limit for very large SIPs. 🤔

Why not do it all from the same "verify manifest" activity?


Another thought, after what point having all the errors returned by the activity becomes unnecessary? Looking at options to reduce payload size, as a user I think I'd never read more than 5 to 10 missing file or wrong checksum errors. I'd check with @fiver-watson to see if we could format these errors differently. For example, if there are 1000 missing files, just return the first 100 followed by "... and 900 missing files more". And the same for missing metadata and wrong checksum errors.

@djjuhasz
Copy link
Contributor

djjuhasz commented Sep 5, 2024

Sounds good to me, reusing the Bag checksums could be an optimization to do later on, when/if is needed. One small detail, right now if the SIP fails in preprocessing, we send the original SIP to the failed bucket.

Ah, good to know. I still think it's better to avoid doing unnecessary SIP transformation work when the checksums don't validate.

Why not do it all from the same "verify manifest" activity?

🤯 Uh, I guess that would work. 😆 I wanted to verify checksums in a separate activity based on the principle of one function doing one thing, but I guess that doesn't always make sense for Temporal activities where there's an overhead for scheduling and orchestration. I also thought a separate verify checksum activity would be a candidate for making a general activity we could use in other projects, but the SFA requirements are very specific to their metadata format. Integrating verifying the checksums with the verify manifest activity will make it harder to move or remove the verify checksum portion in the future, but I think the benefits outweigh this small detriment.

Another thought, after what point having all the errors returned by the activity becomes unnecessary? Looking at options to reduce payload size, as a user I think I'd never read more than 5 to 10 missing file or wrong checksum errors. I'd check with @fiver-watson to see if we could format these errors differently. For example, if there are 1000 missing files, just return the first 100 followed by "... and 900 missing files more". And the same for missing metadata and wrong checksum errors.

Sounds reasonable to me. I'll wait for @fiver-watson or @sallain to weigh in on this option.

djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40

- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40

- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40

- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40

- Switch to `manifest.Files()` to parse SFA manifest files more
  efficiently
- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40.

- Switch to `manifest.Files()` to parse SFA manifest files more
  efficiently
- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 6, 2024
Fixes #40.

- Switch to `manifest.Files()` to parse SFA manifest files more
  efficiently
- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 11, 2024
Fixes #40.

- Switch to `manifest.Files()` to parse SFA manifest files more
  efficiently
- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
djjuhasz added a commit that referenced this issue Sep 11, 2024
Fixes #40.

- Switch to `manifest.Files()` to parse SFA manifest files more
  efficiently
- Add checksum verification to the "verify manifest" activity
- Move "internal/workflow/testdata/little-Test-AIP-Digitization" to
  "internal/testdata/little-Test-AIP-Digitization"
- Update "tesdata/little-Test-AIP-Digitization" file hashes to match
  file contents
- Add a verify checksums event to the preprocessing workflow
@github-project-automation github-project-automation bot moved this from ⏳ In Progress to 🎉 Done in Enduro Sep 11, 2024
@aseles13
Copy link

aseles13 commented Oct 2, 2024

Tested and Enduro is now logging materials where the checksum is missing or incorrect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants