Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OAI-harvesting from DataCite #10909

Open
landreev opened this issue Oct 4, 2024 · 4 comments · May be fixed by #11011
Open

Add support for OAI-harvesting from DataCite #10909

landreev opened this issue Oct 4, 2024 · 4 comments · May be fixed by #11011
Assignees
Labels
FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) GREI 3 Search and Browse NIH CAFE Issues related to and/or funded by the NIH CAFE project Size: 80 A percentage of a sprint. 56 hours.

Comments

@landreev
Copy link
Contributor

landreev commented Oct 4, 2024

DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)

There is a couple of issues that must be addressed before our OAI client implementation is able to do that.

  1. The oai_dc import code in Dataverse expects the metadata fragment to be self-contained, and, most importantly have the main persistent identifier (the DOI in this case) to be present in the <dc:identifier> field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the <identifier> field, like this:
<record>
<header>
      <identifier>doi:10.7910/dvn/tjclkp</identifier>
      <datestamp>2023-01-03T21:08:00Z</datestamp>
      <setSpec>HARVARDU</setSpec>
      <setSpec>GDCC.HARVARD-DV</setSpec>
</header>
<metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Open Source at Harvard</dc:title>
         <dc:creator>Durbin, Philip</dc:creator>
         <dc:publisher>Harvard Dataverse</dc:publisher>
         <dc:date>2017</dc:date>
         <dc:date>Issued: 2017</dc:date>
         <dc:description>The tabular file contains information ...</dc:description>
         <dc:contributor>Durbin, Philip</dc:contributor>
         <dc:type>Dataset</dc:type>
     </oai_dc:dc>
</metadata>
</record>

Without the <dc:identifier>, our code in its current form is failing to import the record above.
All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester).

  1. DataCite OAI implementation offers a very promising feature of accepting arbitrary search queries as the OAI set names (https://support.datacite.org/docs/datacite-oai-pmh#arbitrary-queries). This would make it possible to harvest individual records by the DOIs (something we've been asked for specifically) or any possible subsets of their offerings.
    Example:
echo "doi%3A10.7910/DVN/TJCLKP" | base64 
ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==

Now you can harvest this "set" made up of one dataset above, as in
https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
Unfortunately for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way.
We should go ahead and implement support for harvesting using ListRecords (it should be faster, if nothing else; we handle it via ListIdentifiers then GetRecord, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)

Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the /api/harvest/clients API to take advantage of these improvements should be both useful and sufficient.

@cmbz cmbz added GREI 3 Search and Browse NIH CAFE Issues related to and/or funded by the NIH CAFE project labels Oct 4, 2024
@DS-INRAE DS-INRAE moved this to ⚠️ Needed/Important in Recherche Data Gouv Oct 7, 2024
@DS-INRAE
Copy link
Member

DS-INRAE commented Oct 8, 2024

Item 1. is also true for other repositories, and would greatly enhance Dataverse's harvesting capacity 😃

@scolapasta
Copy link
Contributor

#2 has been split off as #10936

@scolapasta scolapasta moved this to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Oct 18, 2024
@scolapasta scolapasta added the Size: 10 A percentage of a sprint. 7 hours. label Oct 18, 2024
@landreev landreev self-assigned this Oct 18, 2024
@landreev landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Oct 18, 2024
@gwendoux gwendoux moved this to Interested in Cirad Dataverse Oct 21, 2024
@landreev
Copy link
Contributor Author

There is an extra issue @scolapasta pointed out in #10937 that I'm adding as task 3. here - in the current scheme of things the set name is stored in the database as a varchar(255). It should be changed to an unlimited text field, since it will be used for arbitrary DataCite search queries. For example, in our immediate use case this is likely going to be a very long list of individual DOIs.

@fgassert
Copy link

Hi Folks,
Glad to see this moving forward 🙇 !
Just a comment that might inform the implementation of this. If you end up changing the harvesting client behavior to hit only ListRecords, this could potentially also allow for the harvesting of any static xml document mirroring the ListRecords response. This opens the door to other potential workarounds for harvesting other metadata.

Here's an example:
https://groups.google.com/g/dataverse-community/c/XrQsCTVZzAE/m/vVIFL6xeDwAJ

landreev added a commit that referenced this issue Nov 1, 2024
…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)
@landreev landreev moved this from In Progress 💻 to On Hold ⌛ in IQSS Dataverse Project Nov 7, 2024
landreev added a commit that referenced this issue Nov 8, 2024
@landreev landreev moved this from On Hold ⌛ to In Progress 💻 in IQSS Dataverse Project Nov 21, 2024
@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 21, 2024
landreev added a commit that referenced this issue Nov 21, 2024
…p, since it's already got a script with .2 in the name. #10909
@landreev landreev added this to the 6.5 milestone Nov 21, 2024
landreev added a commit that referenced this issue Nov 23, 2024
…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)
@landreev landreev removed this from the 6.5 milestone Nov 25, 2024
@landreev landreev moved this from In Progress 💻 to On Hold ⌛ in IQSS Dataverse Project Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) GREI 3 Search and Browse NIH CAFE Issues related to and/or funded by the NIH CAFE project Size: 80 A percentage of a sprint. 56 hours.
Projects
Status: Interested
Status: On Hold ⌛
Status: ⚠️ Needed/Important
Development

Successfully merging a pull request may close this issue.

5 participants