Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL/7493 Batch Archiving #8610

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Apr 13, 2022

What this PR does / why we need it: This PR implements an api call to allow archiving of all unarchived datasets/only the latest version of unarchived datasets that can be used in lieu of configuring a post-publish workflow, or used as a catchup mechanism to archive datasets that were published prior to an archiving workflow being configured. The api also has a view-only param that will list dataset versions that will be archived without doing it.

Which issue(s) this PR closes:

Closes #7493

Special notes for your reviewer: The code includes a ToDo to update the logging - once the changes to support Failure/Pending/Success status (#8696) is merged, this code should/will be updated to appropriately count Failure status as failed and Pending status as a success (the processing is async so pending and success are both successful launches w.r.t. this code.

Suggestions on how to test this: Setup your favorite archiver (e.g. the Local one), publisha few datasets /versions and then run this api to see if all/the latest versions cause bags to be generated and sent to your archive location. (Can also try the list-only option described in the docs to just list which ones will be done and then confirm that the same list is actually done.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: planning one release note across PRs

Additional documentation:

@qqmyers qqmyers added the TDL of interest to the Texas Digital Library label Apr 13, 2022
@coveralls
Copy link

coveralls commented Apr 13, 2022

Coverage Status

Coverage decreased (-0.02%) to 19.712% when pulling e4a228d on TexasDigitalLibrary:TDL/7493-batch_archiving-only into af22d3f on IQSS:develop.

@qqmyers qqmyers added HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A labels Jul 15, 2022
@qqmyers
Copy link
Member Author

qqmyers commented Jul 15, 2022

FWIW: tagging this as 3a since it includes changes to the archive api (POST vs GET, namechange) that we should have before trying to sync with 3A services again.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 20, 2022

@scolapasta - 3A could/should have the changes to the API naming that is in the PR. I can pull that out elsewhere if we don't want to handle the new batch call now.

@scolapasta scolapasta self-assigned this Jul 20, 2022
*/
public List<DatasetVersion> getUnarchivedDatasetVersions(){

String queryString = "SELECT OBJECT(o) FROM DatasetVersion AS o WHERE o.releaseTime IS NOT NULL and o.archivalCopyLocation IS NULL";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a named query?

// Note - the user is being set in the session so it becomes part of the
// DataverseRequest and is sent to the back-end command where it is used to get
// the API Token which is then used to retrieve files (e.g. via S3 direct
// downloads) to create the Bag
session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this change while we're in this code?

// DataverseRequest and is sent to the back-end command where it is used to get
// the API Token which is then used to retrieve files (e.g. via S3 direct
// downloads) to create the Bag
session.setUser(au);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, if we do decide to make the change

String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName);
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0));
final DataverseRequest request = dvRequestService.getDataverseRequest();
if (cmd != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain what this line is doing? (i.e. why is a command being created before this and checked for null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create method is trying to read the property and instantiate the specified class (a subclass of AbstractSubmitToArchiveCommand) via reflection. If the class doesn't exist, it would fail/return null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?

logger.info("Archiving complete: " + successes + " Successes, " + failures + " Failures. See prior log messages for details.");
}
}).start();
return ok("Archiving all unarchived published dataset versions using " + cmd.getClass().getCanonicalName() + ". Processing can take significant time for large datasets/ large numbers of dataset versions. View log and/or check archive for results.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh is it to havr the command here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really - I could return the className string as done in the null case in the else clause.


The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters.
* listonly=true will cause the API to list dataset versions that would be archived but will not take any action.
* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get how this works, but what's the reason to limit this way? (counting both successes and failures)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listonly=true gives you a list, so with limit working this way you can make sure that only the things you listed will get processed when you drop listonly=true.
Overall, the concern is about load, particularly if/when something is misconfigured and everything will fail after all the work to create a bag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~yes - it's the new named query that is getting the list so unless something affects the return order from that (which I think will go in id order by default), it wouldn't change.


The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters.
* listonly=true will cause the API to list dataset versions that would be archived but will not take any action.
* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?


try {
@SuppressWarnings("unchecked")
List<DatasetVersion> dsl = em.createNamedQuery("DatasetVersion.findUnarchivedReleasedVersion").getResultList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Named Query should still be able to take a type, si you don't need the SuppressWarnings.

String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName);
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0));
final DataverseRequest request = dvRequestService.getDataverseRequest();
if (cmd != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?

// the API Token which is then used to retrieve files (e.g. via S3 direct
// downloads) to create the Bag
session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead.
// Note - the user is being set in the session so it becomes part of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed I think this comment can go, no?

try {
AuthenticatedUser au = findAuthenticatedUserOrDie();

// Note - the user is being set in the session so it becomes part of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@scolapasta scolapasta removed their assignment Jul 26, 2022
@kcondon kcondon self-assigned this Jul 29, 2022
@scolapasta scolapasta added this to the 5.12 milestone Aug 5, 2022
@kcondon kcondon assigned qqmyers and unassigned kcondon Aug 5, 2022
@kcondon
Copy link
Contributor

kcondon commented Aug 5, 2022

Not seeing json output, logger needs to be fine rather than info, seeing null ptr at end of log output.

A system exception occurred during an invocation on EJB Admin, method: public javax.ws.rs.core.Response edu.harvard.iq.dataverse.api.Admin.archiveAllUnarchivedDatasetVersions(boolean,java.lang.Integer,boolean)]]

Caused by: java.lang.NullPointerException
at edu.harvard.iq.dataverse.DatasetVersion.getFriendlyVersionNumber(DatasetVersion.java:539)

@qqmyers qqmyers removed their assignment Aug 8, 2022
@kcondon kcondon self-assigned this Aug 8, 2022
@kcondon kcondon merged commit 04bfd3d into IQSS:develop Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A TDL of interest to the Texas Digital Library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support batch archiving
4 participants