Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: hidden files interfere with bag validation #850

Closed
sallain opened this issue Feb 1, 2024 · 8 comments
Closed

Problem: hidden files interfere with bag validation #850

sallain opened this issue Feb 1, 2024 · 8 comments

Comments

@sallain
Copy link
Collaborator

sallain commented Feb 1, 2024

Is your feature request related to a problem? Please describe.

When submitting a transfer in the bag format, the bag validation check in Archivematica/a3m will fail if there are hidden files in the bag because the hidden files are not included in the bag manifest. This is a particular issue for Mac users, since Macs often add dotfiles (e.g. .DS_Store). This can be remedied by the user by manually removing hidden files from the bags before they are transferred; however, this is both cumbersome and limiting, since the dotfiles can be created every time the user interacts with a file.

A bag validation failure in Archivematica stops the transfer process altogether, so the user has to identify the bag that errored out, remove the hidden files, and restart the ingest.

Describe the solution you'd like

I'd like to prevent any hidden files from being transferred with the bag. The solution should check for and remove hidden files before the bag transfer is ingested into Archivematica/a3m.

In Legacy Enduro, this is done at the point when the bag is copied from the transfer source location to the processing location. Any file beginning with a . is not copied.

This feature should be configurable, so that users can keep hidden files if they choose. It would also be preferrable to allow users to edit the list of files that should be removed/ignored, as is done in Archivematica/a3m's Remove hidden files and directories and Remove unneeded files jobs.

Describe alternatives you've considered

The manual method mentioned above does work but it is susceptible to human error, and might need to be repeated should the user have a need to look at the files in the bag.

Additional context

Note that this is only a requirement for bagged transfers. For standard and other non-bagged transfer types, Archivematica/a3m remove hidden files as a matter of course during the early stages of processing.

The client for whom this has been an issue uses unzipped bags, which both lends itself to the problem manifesting AND provides the easy solution of simply not copying dotfiles, as is done in Legacy Enduro. I'm not sure how the issue would be dealt with in a zipped bag, where the whole bag is copied as a single entity. Perhaps focusing on the unzipped bag example is the easiest starting point.

@sallain sallain changed the title Feature: prevent hidden files from interfering with bag validation Problem: hidden files interfere with bag validation Feb 2, 2024
@jhsimpson
Copy link

@sallain should there be a premis event recorded for the file removal?

@djjuhasz
Copy link
Collaborator

djjuhasz commented Feb 5, 2024

Note that due to issue #845 Enduro can not currently process unzipped Bags that are uploaded via MinIO. Enduro should be able to process unzipped Bags using the filesystem watcher, but I've never tested this option to confirm it works.

@djjuhasz
Copy link
Collaborator

djjuhasz commented Feb 5, 2024

@sallain should there be a premis event recorded for the file removal?

I think this raises an interesting point. If a hidden file (e.g. .DS_Store) is present when the Bag is created, then I think BagIt will add the file to the Bag manifest and checksum files, and in this case we should not remove the hidden file because it will cause validation to fail. If a hidden file is added after the Bag is created, then it will have no record in the Bag manifest or checksum files, so it must be removed for the Bag to validate.

In the second case, I don't think there is any need to add a PREMIS event about the removal of the hidden file - the file clearly was not meant to be part of the transfer payload.

@sallain
Copy link
Collaborator Author

sallain commented Feb 5, 2024

@djjuhasz @jhsimpson The use case as described certainly falls into the latter category, which you could expand on to say that the hidden files are both unexpected and unwanted. However, in my opinion it's still a material change to the bag as deposited, regardless of whether or not the user intended for the hidden file to be there, so the question is - do we need to record that the system made this change in order for the system to be a responsible steward of this data?

Along with removing files, though, in order to record a PREMIS event the system would also need to ADD a file to the transfer. The current mechanism for recording an external PREMIS event is to create a premis.xml file, which contains the event. The premis.xml file is stored in the transfer's metadata directory, and the contents are parsed into the METS file. This prompts more questions:

  • Is it worth it to further disrupt the bag by adding a file? Because we are dealing with bags here, we would need to create a new metadata directory, if needed, and create and add the premis.xml file and then rebag the contents so that the manifests are up to date.
  • If the file no longer exists, what will the PREMIS event connect to? We currently have to write PREMIS events at the file level; we cannot write them at the intellectual entity level (PREMIS 3 permits this but our PREMIS implementation does not). The hidden files would be removed before the transfer gets to a3m/Archivematica, so the preservation system will not know about them. I don't believe that we can currently include fileSec information in the premis.xml file.

So, is there enough value added by recording the file removal to say yes to question 1 and figure out a solution to question 2? Would love to hear your thoughts.

We don't need to copy the Archivematica way of doing it, but it's worth noting that Archivematica does not create a PREMIS event for the removal of hidden files.

@djjuhasz djjuhasz self-assigned this Feb 8, 2024
@djjuhasz
Copy link
Collaborator

djjuhasz commented Feb 22, 2024

I've done a bunch of work on this issue on branch dev/issue-850-remove-hidden-files, but it's turned into a giant PR. I'm going to start again from main, and break up the changes into a number of smaller issues and PRs:

@djjuhasz djjuhasz removed their assignment Mar 28, 2024
@djjuhasz
Copy link
Collaborator

@Diogenesoftoronto developed https://github.com/artefactual-sdps/remove-files-activity to remove hidden files from a transfer. Because we are planning to use that script in preprocessing as a child workflow there's no need to re-implement it here.

@aseles13
Copy link

aseles13 commented Apr 3, 2024

Does this issue need to be closed? Or are there things that still need to be done? @sallain

@djjuhasz
Copy link
Collaborator

djjuhasz commented Apr 11, 2024

I've created artefactual-sdps/temporal-activities#2 for a new implementation of the remove hidden files activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants