Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Dataset: add support for migrating (with existing DOIs) via native API #3083

Closed
bmckinney opened this issue Apr 21, 2016 · 75 comments
Closed

Comments

@bmckinney
Copy link
Contributor

bmckinney commented Apr 21, 2016

One of the SBGrid migration requirements is to move the existing datasets, along with their DOIs, into Dataverse. Currently, it appears that this can only be done using the DDI XML import.

@pdurbin pdurbin changed the title Create Dataset: add support for migrating/harvesting (with existing DOIs) via native API Create Dataset: add support for migrating/harvesting (with existing DOIs) via native API Sep 15, 2016
@pdurbin
Copy link
Member

pdurbin commented Sep 15, 2016

@bmckinney will create a new pull request to handle these two remaining items:

  • Dataverse theme
  • Dataset DOI

As of 4.5, one should be able to set the license via native API.

pdurbin added a commit that referenced this issue Sep 22, 2016
REST Assured tests have been cleaned up as well.
@pdurbin
Copy link
Member

pdurbin commented Sep 22, 2016

@bmckinney I just added a34239c to your "sbgrid-migrate-doi" branch and created pull request #3377. At https://waffle.io/IQSS/dataverse I put this issue in Code Review. Please let me know if we can advance it to QA. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Sep 22, 2016

Also, I wanted to mention that setting a Dataverse theme was added by @sekmiller in 92e89d0 which already made it into 4.5. I'm taking @bmckinney 's word that it works properly. He seems satisfied with it at least.

@bmckinney again, when you have a minute, please let me know if a34239c looks ok and I'll send this to QA in Waffle.

@pdurbin
Copy link
Member

pdurbin commented Sep 23, 2016

This is one of the two issues (the other being #3371) that @djbrooke @bmckinney and I all agree we'd like to get into 4.6 so I'm adding that milestone.

@pdurbin pdurbin added this to the 4.6 - File Replace milestone Sep 23, 2016
@bmckinney
Copy link
Contributor Author

This looks great to me and ready for QA!

@kcondon
Copy link
Contributor

kcondon commented Jul 11, 2018

OK, basic regression testing works: create, update, publish datasets, harvest.
Tested basic import, both release and not release, handles and dois, including files and that works.
To effectively use this endpoint, some documentation and guidance is needed.
A bit more testing is needed around metadata blocks but generally completed.
Here are the issues and recommendations:

  • Provide importable sample json files, simple, complex.
    This would help shorten learning curve and help eliminate issues.

  • Storage identifiers.
    These need to be specified for any files being imported. Internally we use a system generated string that becomes a pointer to a file (ie. filename) on the storage system (local disk, S3, Swift). What is the expected format for this id? Are there any constraints or limitations? For instance, must be a valid filename or reference in the storage system being used, must be unique within a dataset.

  • Document guidance on importing files.
    Importing a dataset using json is only part of the process to import files. When using local file system storage, the storage identifier along with the dataset identifier, and the configured files directory, specify a file system location. The actual files need to be placed there for this to work. Also, care should be taken with respect to specifying actual file size, md5, file type for each file.

  • Document how to import a package file versus a single file.

  • File PID behavior may be unexpected so should be changed or documented.
    File PIDs are ignored in import json.
    File PIDs are automatically generated based on currently configured authority and protocol and scheme (dependent/independent) on dataset publish as they are now. This may not be desired or expected behavior. This should either be changed or documented.

  • Provide file system checks to ensure file integrity.
    To ensure file integrity on import, it was suggested that files be in place at time of import and that import would check whether the storage identifier could be used to access the file and the file size and possibly the md5 are accurate.

  • Document Imported Dataset PID limitations.
    Imported Dataset PIDs need to be resolvable by Dataverse (ie. published hdl or doi) at import or it will fail.
    If the imported PID namespace (authority) is different the one Dataverse is configured to use, it will import as long as it is resolvable by Dataverse but may have expected follow on behavior.

  • Imported PIDs that use a different protocol or authority than Dataverse are imported if resolvable but then do not update PID metadata on subsequent update and publish.

  • Documentation needs to give example of how to specify the PID in json in addition to command line. Sample file provides info but example uses outdated authority/identifier format before shoulder was moved from authority into identifier.

  • Import using PID in json file does not work, throws error.

@michbarsinai
Copy link
Member

Thanks, @kcondon.
We have some documentation, I'll see what can be improved (@pameyer - do you have anything?).

As for files, I think this is out-of-scope for #3083, which concentrates on packages. Files can be added later based on this work.

@pameyer
Copy link
Contributor

pameyer commented Jul 11, 2018

I do - @michbarsinai do you want first crack at native-api.rst, or should I do a first pass?

@michbarsinai
Copy link
Member

michbarsinai commented Jul 11, 2018 via email

@pameyer
Copy link
Contributor

pameyer commented Jul 11, 2018

Ok - I gave it a first pass.

@michbarsinai
Copy link
Member

Gave it another pass. Added PID data to the JSON etc.

We now return to @kcondon at IQSS Central.

@kcondon
Copy link
Contributor

kcondon commented Jul 12, 2018

OK, I've added checkboxes to my list. Doc looks good, will check on files issues, and one remaining issue is doc how to specify pid in json. I tried using the sample file that was added but got this error:
{"status":"ERROR","message":"Please provide a persistent identifier, either by including it in the JSON, or by using the pid query parameter."}

@dlmurphy dlmurphy self-assigned this Jul 12, 2018
dlmurphy pushed a commit that referenced this issue Jul 12, 2018
Some minor edits, after consultation w/ Pete, Kevin, and Leonid.
@dlmurphy dlmurphy removed their assignment Jul 12, 2018
@pdurbin pdurbin added this to the 4.9.2 - Stata Upgrades, etc. milestone Jul 13, 2018
@kcondon kcondon closed this as completed Jul 13, 2018
@michbarsinai
Copy link
Member

One summery note on this: the automated integration tests ("docker-aio") were a crucial part of making sure this rather wide refactoring works. Kudos to @pdurbin for creating and maintaining them, and to @djbrooke for deciding we make sure they all pass before proceeding to manual testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests