Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a data repository, I need to harvest additional metadata in OAI_DC records #4176

Open
solhm opened this issue Oct 5, 2017 · 8 comments
Open
Labels
Feature: Harvesting Feature: Metadata Type: Feature a feature request User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh

Comments

@solhm
Copy link
Contributor

solhm commented Oct 5, 2017

We migrated from 3.0 to 4.7 and when we try to retrieve oai data using
oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=[identfierID] we are missing some elements. As is shown on the screenshots, particularly
. dc:relation
· dc:description (citation)
· dc:coverage (time and geographic)
· dc:rights
are not there in the new version which existed in the 3.0
We wonder if we should configure something to get the same response, in order to not affect external systems that are harvesting our metadata as they are currently exposed by our Dataverse 3.0.

v4.7
screenshot_4 7
v3.0
screenshot_3 0

@landreev
Copy link
Contributor

landreev commented Oct 6, 2017

There appears to be several things going on:

The "dc:description" field:

It looks like in 3.0 we used to populate this field with BOTH the content of the "description" field from the dataset (study) metadata; AND the citation. And in 4.* we are only exporting the description. Interestingly enough, you are the first person to notice this (this maybe because most of our users use the DDI for harvesting?).

I don't know if it was dropped on purpose and if there was some specific reason for it (?). We'll have this reviewed by those on our team who normally handle all things metadata and exports (@jggautier ?). But technically, this would be a trivial fix, to put the citation text back there.

"dc:relation" field:

Similarly, it looks like back in 3.0 we were packing several different metadata elements into this field as well:

  • "Study related publications"
  • "Study related materials"
  • "Study related studies" (whatever these things were back in 3.0...)

In 4.* we are only exporting the contents of the "relatedDatasets" metadata field as "dc:relation".

(In your case, in the example above, what 3.0 metadata field did that text come from?).

Again, we'll need to review this, if it's just a matter of using this field to export some extra 4.* metadata fields...

"dc:rights" field:

OK, this is definitely a bug on our part; we simply dropped it from our OAI_DC exports, seemingly by mistake. Note that we are still exporting it as part of "DCTERMS" - the "extended DC". OK, in case this is already confusing: in 4.0 we export the metadata in TWO different DC formats: the original, 15 field Dublin Core, that is used in OAI harvesting ("OAI_DC"); and as "DCTERMS" - the extended DC, with the 15 original + 40 (?) extra fields total. This is the format you get when you go to the "Metadata" tab on your dataset page, then click on "Export Metadata" and "Dublin Core". (Please try this with the dataset above; you should be getting a record with the "dcterms:rights" field in it.
But, of course, "rights" IS one of the 15 original DC fields, so it should be included in the harvestable DC records as well. We'll fix this.

"dc:coverage" field:

This is also one of the 15 base DC fields. Yes, we were using this field for both the time and geo coverage in 3.0. In 4.* we appear to have switched to exporting these metadata values as "temporal" and "spatial" fields. However, both of these are extended, DCTERMS-only fields. So this is why they are missing in the harvestable, OAI_DC records.
But this also seems like an easy fix - we should just go back to packaging this information in the "dc:coverage" field, when cooking the harvestable DC records.

@jggautier
Copy link
Contributor

jggautier commented Oct 10, 2017

I'm looking at other repositories, reading about and have emailed a few people about best practices, but I think right now that dc:rights, dc:coverage and dc:relation should be re-added in the ways described below.

Description for dataset citation
DCMI suggests (but doesn't recommend?) that bibliographic citations go in either an identifier element or description element (see the third section here).

<dc:rights>Global Wheat Program; IWIN Collaborators; Ammar, Karim; Payne, Thomas, 2014, "46th International Durum Yield Nursery", http://hdl.handle.net/11529/10998 International Maize and Wheat Improvement Center, V2</dc:rights>

Rights
Either the waiver (CC0) or metadata in each of the eight "Terms of Use" fields should be in its own dc:rights element, starting with the name of the field:

<dc:rights>Waiver: https://creativecommons.org/publicdomain/zero/1.0/</dc:rights>

or

<dc:rights>Terms of Use: Text</dc:rights>
<dc:rights>Confidentiality Declaration: Text</dc:rights>
<dc:rights>Special Permissions: Text</rights>
...

Text in the Terms of Access, Availability Status and Contact for Access fields should also each go in their own dc:rights elements, starting with the name of the field:

<dc:rights>Terms of Access: Text</dc:rights>
<dc:rights>Availability Status: Text</dc:rights>
<dc:rights>Contact for Access: Text</rights>

This is an addition to the Dataverse fields mapped to dc:rights in this Dataverse 3 crosswalk.

Coverage for geospatial
In the example in @solhm's original post, it appears that if a dataset had multiple "Country/Nation" metadata, all were put in one dc:coverage element:

<dc:coverage>Country/Nation: ALGERIA, BOLIVIA, BURUNDI, CANADA, CYPRUS, EGYPT, ERITREA, INDIA, IRAN, ITALY, KAZAKHSTAN, MEXICO, MONGOLIA, MOROCCO, NEPAL, PAKISTAN, PORTUGAL, SERBIA, SUDAN, SWITZERLAND, TUNISIA, TURKEY
</dc:coverage>

Each geographic location should be in its own element.

<dc:coverage>BOLIVIA</dc:coverage>
<dc:coverage>ALGERIA</dc:coverage>
...

I removed the prepended "Country/Nation" text because if a city, state and/or "Other" is included with the country/nation in the same compound field, starting with "Country/Nation" won't make sense when all fields in that compound field indicating one geographic location should be concatenated in one dc:coverage element:

<dc:coverage>La Paz, Boliva</dc:coverage>
<dc:coverage>Cambridge, MA, United States</dc:coverage>

From what I've seen, the fields should be ordered most specific to general, from left to right, so City, State, Country/Nation, Other.

(The Dublin Core metadata you can export from a dataset page, which uses dcterms, puts the text in each field of one compound field in its own <dcterms:spatial> element, even though all fields in one compound field describe one location:

screen shot 2017-10-10

I think this should be changed so that all fields in the compound field are concatenated in one dcterms:spatial element.

<dcterms:spatial>El Tuma La Dalia, Nicaragua, Matagalpa</dcterms:spatial>)

Coverage for dates
Start and end date pairs in each Time Period Covered compound field should go in their own dc:coverage field:

<dc:coverage>2014</dc:coverage>
<dc:coverage>2000-2005</dc:coverage>

Relation
Looks like the three Dataverse 3.0 fields @landreev mentioned exist in Dataverse 4.x:

  • Related Publication = Study related publications
  • Related Material = Study related materials
  • Related Datasets = Study related studies

And in the above example OAI_DC record, the dc:relation text came from the "Other References" field in Dataverse 3 (of this dataset), which exists in Dataverse 4.

Related publication is a compound field in 4.x (judging by this Dataverse 3.x metadata crosswalk, it was one free text field in Dataverse 3.x). I think any fields within it should be in one dc:relation element:

<dc:relation>(1)Citation text, (2)ID Type: (3)ID Number, (4)URL</dc:relation>

screen shot 2017-10

<dc:relation>(1)Colin Allen, Hongliang Luo, Jaimie Murdock, Jianghuai Pu, 
Xiaohong Wang, Yanjie Zhai, Kun Zhao. (2017) Topic Modeling the Hàn diăn 
Ancient Classics (汉典古籍). Journal of Cultural Analytics. doi:10.22148/16.016 
(2)doi: (3)10.22148/16.016 (4)https://doi.org/10.22148/16.016</dc:relation>

Text in the other three fields should go in their own dc:relation elements.

<dc:relation>Other References: Elite Durum Yield Trial (EDYT)</dc:relation>
<dc:relation>Related Publication: Colin Allen, Hongliang Luo, Jaimie Murdock, Jianghuai Pu, 
Xiaohong Wang, Yanjie Zhai, Kun Zhao. (2017) Topic Modeling the Hàn diăn 
Ancient Classics (汉典古籍). Journal of Cultural Analytics. doi:10.22148/16.016 
doi: 10.22148/16.016 https://doi.org/10.22148/16.016</dc:relation>
<dc:relation>Related Material: Text</dc:relation>
<dc:relation>Related Datasets: Text</dc:relation>

@djbrooke
Copy link
Contributor

@landreev @jggautier good details on this - thank you.

Can one of you leave a comment with a short list of the specific changes we should make so that we can get this estimated and into a sprint?

@jggautier
Copy link
Contributor

jggautier commented Oct 13, 2017

Here's as short and as (hopefully not unnecessarily) specific as I can get it.

In the simplified Dublin Core export:

  • Include in <dc:description>: Dataset citation

<dc:description>Global Wheat Program; IWIN Collaborators; Ammar, Karim; Payne, Thomas, 2014, "46th International Durum Yield Nursery", http://hdl.handle.net/11529/10998 International Maize and Wheat Improvement Center, V2</dc:description>

  • Include in <dc:relation>: each set of Related Publication fields (concatenating Citation, ID Type, ID Number and URL, each separated with a space)
<dc:relation>Related Publication: Colin Allen, Hongliang Luo, Jaimie Murdock, Jianghuai Pu, 
Xiaohong Wang, Yanjie Zhai, Kun Zhao. (2017) Topic Modeling the Hàn diăn 
Ancient Classics (汉典古籍). Journal of Cultural Analytics. doi:10.22148/16.016 
doi: 10.22148/16.016 https://doi.org/10.22148/16.016</dc:relation>
  • Dataverse 4.x already maps Related Datasets to dc:relation. Prepend "Related Datasets: " to the value.

<dc:relation>Related Dataset: Text</dc:relation>

  • Include in <dc:relation>: Related Material, Other References (prepending the name of the field, a colon and a space)

<dc:relation>Related Material: Text</dc:relation>
<dc:relation>Other References: Text</dc:relation>

  • Include in <dc:rights>: Availability Status, Citation Requirements, Conditions, Confidentiality Declaration, Contact for Access, Data Access Place, Depositor Requirements, Disclaimer, Restrictions, Special Permissions, Terms of Access, Terms of Use (prepending the name of the field, a colon and a space) (exclude Original Archive, Size of Collection and Study Completion)

<dc:rights>Availability Status: Text</dc:rights>
<dc:rights>Citation Requirements: Text</dc:rights>
<dc:rights>Conditions: Text</dc:rights>
...

  • Include in <dc:rights>: Waiver (use the URL of CC0 instead of the text 'CC0')

<dc:rights>https://creativecommons.org/publicdomain/zero/1.0/</dc:rights>

  • Include in <dc:coverage>: each set of Time Period Covered fields (concatenating Start and End dates, separated with a forward slash)

<dc:coverage>2000</dc:coverage>
<dc:coverage>2000-10-15/2014-10-15</dc:coverage>

  • Include in <dc:coverage>: each set of Geographic Coverage fields (concatenating City, State/Province, Country/Nation, and Other, separated with a comma and a space)

<dc:coverage>El Tuma La Dalia, Matagalpa, Nicaragua</dc:coverage>
<dc:coverage>Cambridge, MA, United States</dc:coverage>

@jggautier jggautier changed the title OAI misses some metadatas in newer version As another data repository, I need additional metadata in OAI_DC records Oct 30, 2017
@jggautier jggautier changed the title As another data repository, I need additional metadata in OAI_DC records As another data repository, I need additional metadata in harvestable OAI_DC records Oct 30, 2017
@jggautier jggautier removed their assignment Oct 30, 2017
@jggautier jggautier changed the title As another data repository, I need additional metadata in harvestable OAI_DC records As a data repository, I need to harvest additional metadata in OAI_DC records Oct 30, 2017
@pdurbin pdurbin added Type: Feature a feature request User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh labels Oct 7, 2023
@DS-INRAE DS-INRAE moved this to ⚠️ Needed/Important in Recherche Data Gouv Jul 10, 2024
@pdurbin
Copy link
Member

pdurbin commented Aug 1, 2024

I'm adding dc:rights in the following pull request:

Please feel free to leave a review to tell me if you think I'm adding it correctly! 😅

@jggautier
Copy link
Contributor

jggautier commented Aug 15, 2024

Thanks @pdurbin. I'm spinning up an AWS instance of your branch so I can learn more about how dc:rights "is mapped (when available) to terms of use, restrictions, and license".

@cmbz
Copy link

cmbz commented Sep 9, 2024

2024/09/09: Keeping open for now.

@jggautier
Copy link
Contributor

The pull request at #10737 won't add dc:rights. @pdurbin and I figured it would be better to tackle dc:rights in other efforts. And it's related to #8129 and #5920.

In a comment in this GitHub issue back in 2017 I wrote about other information that we might consider adding to the OAI_DC records. Our next steps might be to:

  • Figure out how much of that information; specifically dc:description, dc:relation, and dc:coverage; has been added to OAI_DC since 2017, and update this GitHub issue if there's already additional metadata being included in OAI-DC records.
  • Figure out if and how to also include in dc:rights a "predefined licenses" and how to include a "custom dataset terms"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Feature: Metadata Type: Feature a feature request User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
Status: ⚠️ Needed/Important
Development

No branches or pull requests

6 participants