Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iqss/7493 batch archiving api call #7494

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
6eae5e4
implement batch processing of new versions to archive
qqmyers Dec 21, 2020
8313404
add listonly and limit options, count commandEx as failure
qqmyers Dec 21, 2020
70d923a
send list in response for listonly
qqmyers Dec 21, 2020
96d3723
fix query
qqmyers Dec 21, 2020
cb9f374
case sensitive in query
qqmyers Dec 21, 2020
76e2396
param to only archive latest version
qqmyers Dec 21, 2020
2e8d990
off by one in limit
qqmyers Dec 21, 2020
b796833
documentation
qqmyers Dec 23, 2020
006a4ba
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
bba8ba0
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
011c97a
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
1a1c28c
updates per review
qqmyers Jan 8, 2021
8a0ad71
Merge branch 'IQSS/7493-batch_archiving_API_call' of https://github.c…
qqmyers Jan 8, 2021
7b5aead
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Jan 29, 2021
fd32dfd
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 23, 2021
805ff95
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Apr 7, 2021
e1415f9
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Apr 13, 2021
ef9a0b9
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Aug 12, 2021
9443e04
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 2, 2022
242befa
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 15, 2022
7047d00
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Feb 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -937,10 +937,10 @@ For example:

``cp <your key file> /usr/local/payara5/glassfish/domains/domain1/files/googlecloudkey.json``

.. _Archiving API Call:
.. _Archiving API Calls:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought this anchor was being used and references to it should be change but apparently it isn't. It sort of makes me wonder if we should delete it if it isn't being used. No strong preference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it seems random as to whether anchors exist or not. They can be useful beyond just internal links (e.g. you can use them to point people directly to that section via URL in email, etc., so having anchors might be a good default for any significant topics. That said, not a show stopper if it gets deleted.


API Call
++++++++
API Calls
+++++++++

Once this configuration is complete, you, as a user with the *PublishDataset* permission, should be able to use the API call to manually submit a DatasetVersion for processing:

Expand All @@ -952,6 +952,18 @@ where:

``{version}`` is the friendly version number, e.g. "1.2".

A batch API call is also available that will attempt to archive any currently unarchived dataset versions:

``curl -H "X-Dataverse-key: <key>" http://localhost:8080/api/admin/archiveAllUnarchivedDataVersions``

The call supports three optional query parameters that can be used in combination:

``listonly={true/false}`` default is false. Using true retrieves the list of unarchived versions but does not attempt to archive any.

``latestonly={true/false}`` default is false. Using true only lists/processes the most recently published version of a given dataset (instead of all published versions).

``limit={n}`` default is no limit/process all unarchived versions (subject to other parameters). Defines a maximum number of versions to attempt to archive in response to one invocation of the API call.

The submitDataVersionToArchive API (and the workflow discussed below) attempt to archive the dataset version via an archive specific method. For Chronopolis, a DuraCloud space named for the dataset (it's DOI with ':' and '.' replaced with '-') is created and two files are uploaded to it: a version-specific datacite.xml metadata file and a BagIt bag containing the data and an OAI-ORE map file. (The datacite.xml file, stored outside the Bag as well as inside is intended to aid in discovery while the ORE map file is 'complete', containing all user-entered metadata and is intended as an archival record.)

In the Chronopolis case, since the transfer from the DuraCloud front-end to archival storage in Chronopolis can take significant time, it is currently up to the admin/curator to submit a 'snap-shot' of the space within DuraCloud and to monitor its successful transfer. Once transfer is complete the space should be deleted, at which point the Dataverse Software API call can be used to submit a Bag for other versions of the same Dataset. (The space is reused, so that archival copies of different Dataset versions correspond to different snapshots of the same DuraCloud space.).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1008,7 +1008,7 @@ public List<HashMap<String, Object>> getBasicDatasetVersionInfo(Dataset dataset)
} // end getBasicDatasetVersionInfo



//Not used?
public HashMap getFileMetadataHistory(DataFile df){

if (df == null){
Expand Down Expand Up @@ -1187,4 +1187,28 @@ private DatasetVersion getPreviousVersionWithUnf(DatasetVersion datasetVersion)
return null;
}

/**
* Execute a query to return DatasetVersion
*
* @param queryString
* @return
*/
public List<DatasetVersion> getUnarchivedDatasetVersions(){

String queryString = "SELECT OBJECT(o) FROM DatasetVersion AS o WHERE o.releaseTime IS NOT NULL and o.archivalCopyLocation IS NULL";

try {
TypedQuery<DatasetVersion> query = em.createQuery(queryString, DatasetVersion.class);
List<DatasetVersion> dsl = query.getResultList();
return dsl;

} catch (javax.persistence.NoResultException e) {
logger.log(Level.FINE, "No unarchived DatasetVersions found: {0}", queryString);
return null;
} catch (EJBException e) {
logger.log(Level.WARNING, "EJBException exception: {0}", e.getMessage());
return null;
}
} // end getUnarchivedDatasetVersions

} // end class
88 changes: 88 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/Admin.java
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
import javax.json.JsonArrayBuilder;
import javax.json.JsonObjectBuilder;
import javax.ws.rs.DELETE;
import javax.ws.rs.DefaultValue;
import javax.ws.rs.GET;
import javax.ws.rs.POST;
import javax.ws.rs.PUT;
Expand Down Expand Up @@ -1773,6 +1774,93 @@ public void run() {
}
}


/**
* Iteratively archives all unarchived dataset versions
* @param
* listonly - don't archive, just list unarchived versions
* limit - max number to process
* lastestonly - only archive the latest versions
* @return
*/
@GET
@Path("/archiveAllUnarchivedDataVersions")
public Response archiveAllUnarchivedDatasetVersions(@QueryParam("listonly") boolean listonly, @QueryParam("limit") Integer limit, @QueryParam("latestonly") boolean latestonly) {

try {
AuthenticatedUser au = findAuthenticatedUserOrDie();
// Note - the user is being set in the session so it becomes part of the
// DataverseRequest and is sent to the back-end command where it is used to get
// the API Token which is then used to retrieve files (e.g. via S3 direct
// downloads) to create the Bag
session.setUser(au);
List<DatasetVersion> dsl = datasetversionService.getUnarchivedDatasetVersions();
if (dsl != null) {
if (listonly) {
JsonArrayBuilder jab = Json.createArrayBuilder();
logger.info("Unarchived versions found: ");
int current = 0;
for (DatasetVersion dv : dsl) {
if (limit != null && current >= limit) {
break;
}
if (!latestonly || dv.equals(dv.getDataset().getLatestVersionForCopy())) {
jab.add(dv.getDataset().getGlobalId().toString() + ", v" + dv.getFriendlyVersionNumber());
logger.info(" " + dv.getDataset().getGlobalId().toString() + ", v" + dv.getFriendlyVersionNumber());
current++;
}
}
return ok(jab);
}
String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName);
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0));

if (cmd != null) {
new Thread(new Runnable() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not personally very familiar with threads but there do seem to be a couple other places in the code (also added by Jim) that use this new Thread/new Runnable pattern and work fine. I think Jakarta EE offers various ways of handling threads and I know we use @Asynchronous in certain places. I'm not suggesting a change but perhaps threads or async in general could be a topic for a future tech hours so we're all clear on the various options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! FWIW: I don't know details but I think this is basically an EJB vs plain Java difference. Using @Asyncrhonous starts a thread in a managed pool (and I think there's a ManagedExcutorService that would allow you to specify a separate pool that could be managed in the EJB config). Probably would be good to recommend Asynch and think about updating these at some point. That said, since this call just starts one thread to serially chunk through things so I don't think it should cause problems as is (and having others who know more weigh in before making a change might be useful).

public void run() {
int total = dsl.size();
int successes = 0;
int failures = 0;
for (DatasetVersion dv : dsl) {
if (limit != null && (successes + failures) >= limit) {
break;
}
if (!latestonly || dv.equals(dv.getDataset().getLatestVersionForCopy())) {
try {
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dv);

dv = commandEngine.submit(cmd);
if (dv.getArchivalCopyLocation() != null) {
successes++;
logger.info("DatasetVersion id=" + dv.getDataset().getGlobalId().toString() + " v" + dv.getFriendlyVersionNumber() + " submitted to Archive at: "
+ dv.getArchivalCopyLocation());
} else {
failures++;
logger.severe("Error submitting version due to conflict/error at Archive for " + dv.getDataset().getGlobalId().toString() + " v" + dv.getFriendlyVersionNumber());
}
} catch (CommandException ex) {
failures++;
logger.log(Level.SEVERE, "Unexpected Exception calling submit archive command", ex);
}
}
logger.fine(successes + failures + " of " + total + " archive submissions complete");
}
logger.info("Archiving complete: " + successes + " Successes, " + failures + " Failures. See prior log messages for details.");
}
}).start();
return ok("Archiving all unarchived published dataset versions using " + cmd.getClass().getCanonicalName() + ". Processing can take significant time for large datasets/ large numbers of dataset versions. View log and/or check archive for results.");
} else {
logger.log(Level.SEVERE, "Could not find Archiver class: " + className);
return error(Status.INTERNAL_SERVER_ERROR, "Could not find Archiver class: " + className);
}
} else {
return error(Status.BAD_REQUEST, "No unarchived published dataset versions found");
}
} catch (WrappedResponse e1) {
return error(Status.UNAUTHORIZED, "api key required");
}
}

@DELETE
@Path("/clearMetricsCache")
public Response clearMetricsCache() {
Expand Down