-
Notifications
You must be signed in to change notification settings - Fork 24
Fixity in Sufia with Fedora 4
Fedora 4 provides Fixity checks out of the box. Below is a description of the steps to update Sufia's implementation of Fixity checks to use Fedora 4's built-in mechanism.
In Sufia we use a queue to schedule Fixity checks on documents. The queueing and execution is done via Resque.
The code to queue a job is in sufia-models/app/models/concerns/sufia/generic_file/audit.rb
and it is (conditionally?) executed when users view a file
def audit.audit(version, force = false)
Sufia.queue.push(AuditJob.new(version.pid, version.dsid, version.versionID))
end
A background worker picks jobs from the queue and executes them. This code is in sufia-models/app/jobs/audit_job.rb
:
def ActiveJob.run
datastream = generic_file.datastreams[datastream_id]
version = datastream.versions.select { |v| v.versionID == version_id}.first
log = run_audit(version)
end
def run_audit(version)
object.class.run_audit(version)
end
The call to object.class.run_audit(version)
goes back to sufia-models/app/models/concerns/sufia/generic_file/audit.rb
to perform the actual Fixity check via dsChecksumValid:
def audit.run_audit(version)
if version.dsChecksumValid
# blah blah blah
else
# blah blah blah
end
check = ChecksumAuditLog.create!(pass: passing, pid: version.pid, dsid: version.dsid, version: version.versionID)
check
end
The call to version.dsChecksumValid
in turn goes to Rubydora https://github.com/projecthydra/rubydora/blob/master/lib/rubydora/datastream.rb#L54
def dsChecksumValid
profile(:validateChecksum=>true)['dsChecksumValid']
end
The profile method is also defined in Rubydora's Datastream class https://github.com/projecthydra/rubydora/blob/master/lib/rubydora/datastream.rb#L256-268
# Retrieve the datastream profile as a hash (and cache it)
# @param opts [Hash] :validateChecksum if you want fedora to validate the checksum
# @return [Hash] see Fedora #getDatastream documentation for keys
def profile opts= {}
if @profile && !(opts[:validateChecksum] && [email protected]_key?('dsChecksumValid'))
## Force a recheck of the profile if they've passed :validateChecksum and we don't have dsChecksumValid
return @profile
end
return @profile = {} unless digital_object.respond_to? :repository
@profile = repository.datastream_profile(pid, dsid, opts[:validateChecksum], asOfDateTime)
end
The following quote describes how the checksumming process works in Fedora 2.2 and seems to be relevant for Fedora 3:
When automatic checksumming is enabled, whenever a object is ingested into Fedora, as each datastream is processed, all of the bytes comprising the content of the datastream are passed to the appropriate checksumming algorithm. This algorithm will compute and return a digital signature for the content of the datastream. [...] These computed datastream checksums will then be stored in the XML representation of the digital object. Additionally, whenever a new datastream is added to an existing object (via addDatastream), and whenever a existing datastream is modified (via modifyDatastreamByValue or modifyDatastreambyReference) a new checksum will be computed and stored in the object. http://www.fedora.info/download/2.2/userdocs/server/features/checksumming.html
TODO: I am not sure yet how dsChecksumValid in Rubydora (Fedora 3?) is different from the new Fixity endpoint in Fedora 4.
Fedora 4 provides a built-in mechanism to perform Fixity checks.
If your repository is configured to retain multiple copies of binary content, when you request a fixity check of that content, Fedora will run fixity checks against each copy it stores. It will also "self-heal" all copies of the content, if it has a good copy available. https://wiki.duraspace.org/display/FF/Durability
We can execute these checks via calls to the HTTP API as described on this page https://wiki.duraspace.org/display/FF/RESTful+HTTP+API#RESTfulHTTPAPI-Fixity. The basic call is an HTTP GET request to the path of a document plus the "/fcr:fixity" sufix, for example:
HTTP GET http://someserver/rest/path/to/document/fcr:fixity
The response will include the result of the check, either SUCCESS or BAD_CHECKSUM (Question: are there other kinds of failures other that BAD_CHECKSUM?)
Below is a partial XML+RDF response for a successful Fixity check. Notice the SUCCESS text in the status element.
<rdf:Description rdf:about="http://localhost:8080/rest/good_datastream#fixity/1408035440423">
<status xmlns="http://fedora.info/definitions/v4/repository#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string">SUCCESS</status>
<hasMessageDigest xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:resource="urn:sha1:fef212288b3c2423ab2c43a39f73031de6c0b057"/>
<hasSize xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:datatype="http://www.w3.org/2001/XMLSchema#int">38</hasSize>
</rdf:Description>
Below is a partial XML+RDF response for a Fixity check that found errors. Notice the BAD_CHECKSUM text in the status element.
<rdf:Description rdf:about="http://localhost:8080/rest/bad_datastream#fixity/1408035437845">
<status xmlns="http://fedora.info/definitions/v4/repository#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BAD_CHECKSUM</status>
<hasMessageDigest xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:resource="urn:sha1:1ad61d5f1cdacac66c10051ab55c37130aa849c1"/>
<hasSize xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:datatype="http://www.w3.org/2001/XMLSchema#int">27</hasSize>
</rdf:Description>
The API returns HTTP 200 status code for both successful and failed Fixity checks.
If the path/to/document in the HTTP GET points to a non-existing document the response will have an HTTP 404 status code and the body of the response will be an HTML page (that we probably want to ignore.)
You can only execute Fixity checks on "datastream" nodes. Attempting to execute a Fixity check on a "object" node will result in an HTTP 404 response. This shouldn't be a problem for us since we should only need to execute Fixity checks on datastreams. It would be nicer if Fedora would return HTTP 400 (Bad Request) or 415 (Unsupported Media Type) in these instances, though.
A word of caution, on Fedora version 4.0.0-beta-01 executing the "check fixity" option in the REST API web page does not seem to report the result of the check even though the API supports the functionality.
According to the documentation Fedora 4 will include a fixity queueing and reporting service but as of August/2014 the current implementation has marked as outdated since it was last changed a year ago. https://github.com/fcrepo4-labs/fcrepo-fixity
If we are going to stop using Rubydora once we migrate to Fedora 4 we will need to update the call to version.dsChecksumValid
to call our new mechanism to kick off the Fixity check in Fedora 4. Not sure if this new mechanism will go into ActiveFedora, Sufia, or somewhere else.
We might still need to leave the Fixity checks as background jobs since according to the Fedora documentation
Checking fixity requires retrieving the content from the binary store and may take some time. https://wiki.duraspace.org/display/FF/RESTful+HTTP+API#RESTfulHTTPAPI-Fixity
In our current Fedora 3 data model we have datastreams for many things beside content (e.g. rights, FITS, Rels-Ext, desc_metadata) and I understand that we currently are performing checksum checks on them. If my understanding is correct and the plan is to migrate those other datastreams to Fedora 4 properties we won't be able to perform Fixity checks on them as Fixity checks are only available for datastreams. Will this be a problem for us? Do we need Fixity checks for those non-content datastreams?