Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create packages with uniform checksum #261

Closed
gothub opened this issue Oct 1, 2020 · 3 comments
Closed

Create packages with uniform checksum #261

gothub opened this issue Oct 1, 2020 · 3 comments
Assignees
Milestone

Comments

@gothub
Copy link
Collaborator

gothub commented Oct 1, 2020

Quick synopsis: all DataPackage members must have the same checksum in order to facilitate serialization to the BagIt format.

The details:
In order to serialize a DataPackage to BagIt format, all package members must have the same checksum. A specific checksum algorithm is not specified, but the payload manifest(s) must include all files in the bag, with their checksum, all using the same algorithm.

In order to facilitate this serialization, a DataPackage must have all package members store their member checksum using the same checksum algorithm.

All workflows for creating a DataPackage must support a consistent checksum:

  1. Create a package from local files
  2. Create a package by downloading from DataONE
    a. an entire package can be downloaded via getDataPackage()
    b. individual members can be downloaded via getDataObject() and added to a DataPackage
    c. either a. or b. can specify lazy-loading, such that the sysmeta is downloaded but not the data bytes
  3. Create a package using a combination of 1. and 2.

One way to support all these workflows is to add a checksumAlgorithm parameter to

  • getDataObject()
  • getDataPackage()

with the default value being SHA-256. When DataObjects are newly created, this algorithm will be used.
When DataObjects are created from objects downloaded from DataONE, if the sysmeta of the existing object has a different checksum algorithm, then it will be recalculated and stored in the sysmeta of the DataObject. If the DataObject was lazy-loaded, then a request is sent to DataONE to calculate the desired checksum for the pid, and the returned value is stored in that DataObject's sysmeta.

Here are the proposed signatures for the modified methods:

setMethod("getDataObject", "D1Client", function(x, identifier, lazyLoad=FALSE, limit="1MB", quiet=TRUE,
                                                checksumAlgorithm="SHA-256")
setMethod("getDataPackage", "D1Client", function(x, identifier, lazyLoad=FALSE, limit="1MB", quiet=TRUE,
                                                 checksumAlgorithm="SHA-256")
@gothub gothub added this to the 2.2.0 milestone Oct 1, 2020
@gothub gothub self-assigned this Oct 1, 2020
@ThomasThelen
Copy link
Member

ThomasThelen commented Oct 1, 2020

Edit: Relative to Metacat's implementation
I think that creating data packages that are using uniform checksum methods in their system metadata is in general a good idea. I don't think that designing the BagIt stuff to expect/need them to be uniform is a good idea. Even when this change is made, we're still left with n data packages that are using a mix of system metadata (which means we'll have to write code to support them anyways).

What we get from a package with uniform checksumming is the performance boost of not having to checksum the files (if we think that's a good idea).

My plan (discussed later in the V2 call) is to checksum every file as it's leaving Metacat by creating the checksum on the fly (see DigestInputStream). Checksumming the leaving bytes is fairly cheap, keeps the code less complex, gives a 'more accurate' checksum (what if there was bit rot since the time of submission), allows for flexibility in the bag checksum algorithm, and gives us the ability to (if we want/can) perform a final validation of streamed checksums vs what exists in the system metadata.

@gothub
Copy link
Collaborator Author

gothub commented Nov 12, 2020

@ThomasThelen the changes to rdataone mentioned here involve using the DataONE MNRead.get() call.

Aren't the changes that you are making to Metacat related to the MNPackage.getPackage() call?
If this is true, then the changes mentioned here aren't affected by your changes.

@gothub
Copy link
Collaborator Author

gothub commented Nov 18, 2020

After testing this a bit more, I realized that the default case should be to NOT recalculate the checksum of each package member as it is downloaded from DataONE. The default case for getDataPackage() was to recalculate checksums as "SHA-256" (the new dataone package default), if they were not already using this checksum algorithm.

The new default is to not re-calculate the checksums when using getDataPackage() or getDataObject(), as this could cause long processing delays for packages with many members. If using one or both of these functions, users will have to specify a checksum algorithm to use if they wish to have the checksum recalculated and stored in the sysmeta of package members.

The use case for this functionality, as described above, is to create a package with all package members using the same checksum algorithm or a different than original algorithm, to allow serialization to BagIt. Note that updates to Bagit serialization will be added in the next dataone release, so this recalculating functionality might not be exercised by users until then (but it's ready now).

Note: I have not found a way using httr to have the checksum calculated as bytes are streamed to the client. If anyone knows of a way to do this, please post here.

@gothub gothub closed this as completed Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants