-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create packages with uniform checksum #261
Comments
Edit: Relative to Metacat's implementation What we get from a package with uniform checksumming is the performance boost of not having to checksum the files (if we think that's a good idea). My plan (discussed later in the V2 call) is to checksum every file as it's leaving Metacat by creating the checksum on the fly (see DigestInputStream). Checksumming the leaving bytes is fairly cheap, keeps the code less complex, gives a 'more accurate' checksum (what if there was bit rot since the time of submission), allows for flexibility in the bag checksum algorithm, and gives us the ability to (if we want/can) perform a final validation of streamed checksums vs what exists in the system metadata. |
@ThomasThelen the changes to rdataone mentioned here involve using the DataONE MNRead.get() call. Aren't the changes that you are making to Metacat related to the MNPackage.getPackage() call? |
After testing this a bit more, I realized that the default case should be to NOT recalculate the checksum of each package member as it is downloaded from DataONE. The default case for The new default is to not re-calculate the checksums when using The use case for this functionality, as described above, is to create a package with all package members using the same checksum algorithm or a different than original algorithm, to allow serialization to BagIt. Note that updates to Bagit serialization will be added in the next Note: I have not found a way using httr to have the checksum calculated as bytes are streamed to the client. If anyone knows of a way to do this, please post here. |
Quick synopsis: all DataPackage members must have the same checksum in order to facilitate serialization to the BagIt format.
The details:
In order to serialize a DataPackage to BagIt format, all package members must have the same checksum. A specific checksum algorithm is not specified, but the payload manifest(s) must include all files in the bag, with their checksum, all using the same algorithm.
In order to facilitate this serialization, a DataPackage must have all package members store their member checksum using the same checksum algorithm.
All workflows for creating a DataPackage must support a consistent checksum:
a. an entire package can be downloaded via
getDataPackage()
b. individual members can be downloaded via
getDataObject()
and added to a DataPackagec. either a. or b. can specify lazy-loading, such that the sysmeta is downloaded but not the data bytes
One way to support all these workflows is to add a
checksumAlgorithm
parameter togetDataObject()
getDataPackage()
with the default value being
SHA-256
. When DataObjects are newly created, this algorithm will be used.When DataObjects are created from objects downloaded from DataONE, if the sysmeta of the existing object has a different checksum algorithm, then it will be recalculated and stored in the sysmeta of the DataObject. If the DataObject was lazy-loaded, then a request is sent to DataONE to calculate the desired checksum for the pid, and the returned value is stored in that DataObject's sysmeta.
Here are the proposed signatures for the modified methods:
The text was updated successfully, but these errors were encountered: