Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preservation bug fix: properly handle rapid updates #96

Merged
merged 13 commits into from
Aug 7, 2019

Conversation

RayPlante
Copy link
Collaborator

Two behaviors of the preservation service sets up a race condition that can make an update to an AIP cause data to be lost:

  • When the preservation service has finished serializing its output bags, it copies the bag files to a storage directory for long-term storage and the NERDm record is sent to the repository ingest service. The storage directory in practice is a staging area where files eventually get replicated to AWS-S3; this replication can take 15-30 minutes.
  • When the preservation service is directed to make an update to a dataset, the service will try to pull in the last published headbag as the seed for the update; to do this, it queries the public distribution service, which in turn looks for the bag in AWS-S3 storage.

If the update comes before the previously saved bags have migrated to S3, then the preservation will either fail to pull over a head bag, or pull over the wrong one (i.e. not the latest). This means that the data (or updated metadata) from that bag will not be represented in the update and will be effectively lost. When this has occurred in the past, all of the data was lost. Further, the output version and sequence number would be wrong, and the new output bags would overwrite previously saved ones.

(In actuality, the data has been recoverable, thanks to versioning being turned on in the S3 bucket.)

This PR addresses this bug with three major changes:

  • The code for retrieving the previous bag and setting it up as a seed was reworked a bit; in particular, it will first look for the latest bag from the storage directory used for staging to S3. Thus, even if the data has not been migrated, yet, we still get the latest version.
  • A check is done on the output bag (pre-splitting) that ensures that all distributions listed in the NERDm metadata with a distribution service download URL are available--either stored in that bag itself or is locate-able in a previous bag.
  • The preservation service cannot overwrite a bag in the storage directory with the same name.

@RayPlante
Copy link
Collaborator Author

RayPlante commented Aug 7, 2019

Self-tested under oar-docker demo:

  1. replicated error with previous release of oar-pdr by removing files from the distributions service's data directory
  2. with this PR branch deployed, ran under same conditions to see error avoided

Will merge for further testing on testdata.

@RayPlante RayPlante merged commit 71c9c67 into integration Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant