Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust how new variable metadata is added to DDI XML exports so that the exports are valid against DDI schema #6554

Closed
jggautier opened this issue Jan 22, 2020 · 11 comments · Fixed by #6794
Assignees

Comments

@jggautier
Copy link
Contributor

The Data Curation Tool adds new metadata elements to the DDI XML that Dataverse produces. The variable elements universe and qstn (and sub-elements inside qstn like literal question, interview instructions, and post question) are in this example:

<var ID="v12560" name="var1" intrvl="discrete">
	<location fileid="f9842"/>
	<labl level="variable">var1</labl>
	<universe>This is the universe.</universe>
	<sumStat type="mean">5.5</sumStat>
	<varFormat type="numeric"/>
	<notes subject="Universal Numeric Fingerprint" level="variable" type="Dataverse:UNF">UNF:6:8OuUohmPouAgl2L3IrfN2A==</notes>
	<notes>
		<![CDATA[ These are the notes for var1, v12560. ]]>
	</notes>
	<qstn>
		<qstnLit>Is this a literal question?</qstnLit>
		<ivuInstr>These are instructions.</ivuInstr>
		<postQTxt>Is this the post question?</postQTxt>
	</qstn>
</var>

The DDI schema definition (view-source:https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd) says that the <qstn> element and its sub elements are out of order here, and should come before <universe>.

Screen Shot 2020-01-21 at 9 10 03 PM

So it should be:

<var ID="v12560" name="var1" intrvl="discrete">
	<location fileid="f9842"/>
	<labl level="variable">var1</labl>
	<qstn>
		<qstnLit>Is this a literal question?</qstnLit>
		<ivuInstr>These are instructions.</ivuInstr>
		<postQTxt>Is this the post question?</postQTxt>
	</qstn>
	<universe>This is the universe.</universe>
	<sumStat type="mean">5.5</sumStat>
	<varFormat type="numeric"/>
	<notes subject="Universal Numeric Fingerprint" level="variable" type="Dataverse:UNF">UNF:6:8OuUohmPouAgl2L3IrfN2A==</notes>
	<notes>
		<![CDATA[ These are the notes for var1, v12560. ]]>
	</notes>
</var>

@lubitchv, do you think this validation problem could create interoperability issues with other systems that expect variable metadata that follows the DDI schema (maybe issues with importing or exporting metadata into or out of tools like Colectica)? As with other DDI export validation issues, I'm not sure how severe this validation error is. It's the first I've seen for the variable level metadata. Please feel free to close this GitHub issue if it's minor. :)

@lubitchv
Copy link
Contributor

@jggautier Thanks for spotting this validation problem. I do not know if it will create issues with other tools, I did not test it with such tools. But this issue (the order of fields in xml) should not be difficult to fix and I will definitely look into it.

@djbrooke
Copy link
Contributor

Thanks @lubitchv !

@lubitchv
Copy link
Contributor

@jggautier What tools do you use for validating XML. I want to fix this issue.

@jggautier
Copy link
Contributor Author

Hi @lubitchv. I use a website at https://www.freeformatter.com/xml-validator-xsd.html for validating XML against schemas. Not sure if it's the best tool, but it's free. Please let me know if you have any other questions

@lubitchv
Copy link
Contributor

Thanks @jggautier . I tried to see how it works on dataverse ddi xml, so I created a dataset on demo https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2F0QMKVB
I downloaded DDI metadata and tried to validate it using that tool. I get an error White Spaces Are Required Between PublicId And SystemId., Line '1', Column '50'. I am probably doing something wrong.

@jggautier
Copy link
Contributor Author

jggautier commented Mar 30, 2020

That error is because the exported XML is still pointing to the wrong DDI schema URL, so you could change:

<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">

to

<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">

(This is fixed in #6553 for Dataverse 4.20.)

Once the right schema location is being referenced, you'll see a bunch of other errors as well, many of them also being fixed in 4.20 (#6650).

Would it be easier to use the unreleased 4.20 version of Dataverse to export the DDI and test with that, so it's easier to spot the validation errors regarding the variable metadata?

@lubitchv
Copy link
Contributor

You are right, I should use the current develop branch that contains all these fixes. So I tried the new develop and got a bunch of errors such as

Cvc-enumeration-valid: Value 'DVN' Is Not Facet-valid With Respect To Enumeration '[archive, Producer]'. It Must Be A Value From The Enumeration., Line '1', Column '446'. Cvc-attribute.3: The Value 'DVN' Of Attribute 'source' On Element 'verStmt' Is Not Valid With Respect To Its Type, '#AnonType_sourceGLOBALS'., Line '1', Column '446'.

I guess I should work through them to see what is going on.

@jggautier
Copy link
Contributor Author

jggautier commented Mar 30, 2020

Yes, that's one of the errors that wasn't fixed in #6650 because it didn't affect that issue's main goal of improving importing/exporting DDI xml. I called those errors "noise" for this issue because they're not errors that have to do with the variable level metadata, so I hoped you could ignore it, unless you'd really like to fix it, which would be great, but I think difficult.

In another GitHub issue I found and suggested fixes for all of these errors: #3648 (comment). There's a Google Doc describing the trickier issues and example valid DDI XML files.

@jggautier
Copy link
Contributor Author

Edit: I forgot to mention that Google doc doesn't include the fixes in made in Dataverse 4.20. I planned to update it sometime after 4.20 is released on Demo Dataverse. But hopefully it and the example DDI xml files are helpful places to start...

@mheppler
Copy link
Contributor

Related DDI XML improvements in the PR 3648 Change "DVN" to default value "producer" and only write the firs… #7094.

@mheppler
Copy link
Contributor

PR #6794 was intended to close this issue, but there appears to be a misformatted closes comment, which left this issue open after it was merged on Apr 6, 2020 and released as part of 5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants