-
Notifications
You must be signed in to change notification settings - Fork 24
Fedora4 Migration
The process of migrating from Fedora3 to Fedora4 will include a duplication of all content and description from the Fedora3 repository to Fedora4. During that process, the majority of descriptive information will be reorganized and rewritten as RDF triples instead of the present xml. The content, for example the files and images that in the repository, will of course remain unchanged. Additionally, the versions of each asset currently preserved in ScholarSphere will be carried over to the new Fedora4 repository so users will have access to previous versions of all their deposited assets.
According to Duraspace's documentation on Fedora4 migration, the process can happen in one of two ways:
- migration content using filesystem projection
- migration vai Fedora3's API
Each have their pros and cons, and it will be determined later which process would be the preferred method. The filesystem projection method uses Fedora4's ability to "project" or inspect a filesystem and ingest its contents. Here, Fedora4 would consume Fedora3's existing datastream and objectstore directories, then copy and transform the data into a new Fedora4 repository. The latter method of using Fedora3's API involves Fedora4 interfacing with Fedora3 and gathering/transforming the data that way.
At present, ScholarSphere's repository contains approximately 3900 Fedora objects and occupies 18 GB for disk space. These figures present no issues with system resources and we should expect no limitations when duplicating this amount of data and structure into a new Fedora4 repository.
Without assigning any dates and assuming everything is TBD, the overall to-do list would be:
- Update current Fedora3 to latest 3.x version
- Deploy latest 3.x app
- Stop current both old and new Fedora app
- Copy/move configs and data from old to new
- rebuild index and db using fedora-rebuild.sh
- Turn on new fedora app and test.
- point SS instances to new Fedora app
- upgrade to Java 7
- stop tomcat.
- update tomcat to use Java7
- update JAVA_OPTS to use G1 garbage collection
- restart tomcat and test
- Deploy Fedora4
- Decide on RDF implementation for descriptive elements
- Begin testing migration and verification process
- Identify time period for migration
- Disable ScholarSphere and Fedora3 services
- Migrate and verify
- Reenable ScholarSphere; Fedora3 is read-only
- Re-evaluate and re-verify migration via user feedback
- Retire Fedora3
Things to which we need answers or about which we don't know enough.
While Fedora3 keeps track of checksums on all datastreams, ScholarSphere's AuditJob will run fixity checks on all the files in the repository. One of the new features in Fedora4 is that it will run fixity checks as well and return the status of a file. We could adjust ScholarSphere's behavior to make use of this new functionality in Fedora4.
More details on the Fixity process in Fedora 4 and how it compares to our approach in Sufia can be found here: Fixity in Sufia with Fedora 4
Fedora4 will track versions of objects, both by the changes in its properties, i.e. the RDF triples that are used to describe, and the content objects that are associated with it, such as a binary file. It is unclear at present how some of the behaviors work. For example, a binary file that is part of a GenericFile object would be a datastream in Fedora3, but is now a related child node in Fedora4. If the content of that file changes, the version of the node changes, but it is unclear if the parent node's version changes as well.
In migrating ScholarSphere to Fedora4, we will also need to preserve the versions of content datastreams as versioned nodes in Fedora4. It is unclear at present how to accomplish this.
Things that must be done prior to migrating.
Fedora3 allowed us to postpone batch creation until all the files had been uploaded. Fedora4 does not offer this and the batch must be saved in an empty state, and then files uploaded and added afterwards. Because this creates the possibility of empty batches, we need to create a "sweeper" that searches for any empty batch job objects and deletes them.
Fedora4 uses Modeshape as the underlying structure of the repository. Because Modeshape is a hierarchy, all objects must point to an originating parent or node, and the entire repository would have one node as its first point of origin. In Fedora3, this concept is absent and the structure is essentially flat. The simplest mapping of a Fedora3 repository to Fedora4 would stipulate that every object be a part of the root node, potentially creating a root node with many child nodes.
Because Modeshape's performance may suffer if there are too many nodes attached to one parent, intermediate nodes should be created to disperse the "width" of the hierarchy into a greater depth. These intermediate nodes are inconsequential to the repository and do not need to reflect any implied structure or meaning. The easiest way to build a deep enough hierarchy is to create pair trees using the Noid. For example, abcd1234
would be split into four levels: /ab/cd/12/34/
and the last node would contain the object.
Descriptive terms for an objects are expressed as RDF triples and are called properties. Scholarsphere currently keeps descriptive metadata in different datastreams. For migration, some or all of the descriptive terms in each datastream will need to be mapped to RDF triples.
Currently we have descriptive terms stored in datastreams:
- descMetadata
- rightsMetadata
- properties
- characterization
The mapping process will either:
- convert all the terms to triples
- convert some terms and store the existing datastream as a new node in Fedora4
- convert no terms to triples and store the existing datastream as a new node in Fedora4
Since all these terms are Dublin Core, the process is straight-forward: convert all the terms in the datastream to RDF triples. No additional work needs to be done because Dublin Core terms are included in Fedora4.
All terms need to be mapped to RDF triples; however, no RDF implementation exists yet. Since rightsMetadata is inherent to Hydra, the community should be able to collectively define one that would be used with any Hydra application using Fedora4.
The properties datastream in Scholarsphere is a "grab bag" of terms mostly used for internal management. No RDF implementation exists, so custom one should be defined and all terms mapped from the existing datastream to the new object in Fedora4.
FITS creates this, as an xml datastream, and as far as we know, there is no RDF implementation of it. Furthermore, creating one might be problematic. The current plan is to use option 2, from above, and convert some terms to RDF such as mime type and others, then store the entire FITS xml as a new content node in Fedora4.
Relationships between existing objects in Fedora3 are defined in the RELS-EXT datastream and will need to be preserved in Fedora4. These objects include:
- GenericFile
- BatchEdit
- Collection
Converting descriptive xml datastreams to RDF properties requires a number of additions to the ActiveFedora codebase. Specifically, an integration/replacement of FedoraLens into ActiveTriples. FedoraLens is a Ruby gem that provides RDF property management to Fedora4 and ActiveTriples is the component to ActiveFedora that allows one to assign these RDF properties as descriptive attributes in their ActiveFedora models. Once ActiveFedora is fully integrated with Fedora4, we may begin to "translate" the xml-based object attributes for Fedora3 to the RDF-based attributes of Fedora4.
The current rights metadata datastream needs an RDF definition.
How do we maintain versions of existing content datastreams from Fedora3 to Fedora4?