Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.json harvesting fails after project-open-data.cio.gov is redirected #1895

Closed
FuhuXia opened this issue Jul 21, 2020 · 8 comments
Closed
Labels
bug Software defect or bug

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Jul 21, 2020

After https://project-open-data.cio.gov is redirected to https://resources.data.gov, data.json harvesting start to fail. It only affects some datasets in some data.json sources. Debugging shows that jsonschema validator tries to reach url https://project-open-data.cio.gov/v1.1/schema/organization.json and raised an exception when it is redirected to resources's homepage.

How to reproduce

harvest GSA's data.json source https://open.gsa.gov/data.json

Expected behavior

harvest should complete

Actual behavior

harvest is stuck. errors on fetch log.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jul 21, 2020

Debugging shows that it only affects dataset that has 3+ level of organizations, as in this one:

...
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"publisher": {
	"@type": "org:Organization",
	"name": "Data.gov",
	"subOrganizationOf": {
		"@type": "org:Organization",
		"name": "Technology Transformation Service",
		"subOrganizationOf": {
			"@type": "org:Organization",
			"name": "General Services Administration"
		}
	}
},
"accrualPeriodicity": "R/P1M",
...

On the 3rd level of org:Organization, validator is going out and reading url https://project-open-data.cio.gov/v1.1/schema/organization.json.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jul 21, 2020

POD schema files have been moved to new location at:

https://resources.data.gov/schemas/dcat-us/v1.1/schema/catalog.json
https://resources.data.gov/schemas/dcat-us/v1.1/schema/catalog.jsonld
https://resources.data.gov/schemas/dcat-us/v1.1/schema/dataset.json
https://resources.data.gov/schemas/dcat-us/v1.1/schema/distribution.json
https://resources.data.gov/schemas/dcat-us/v1.1/schema/organization.json
https://resources.data.gov/schemas/dcat-us/v1.1/schema/vcard.json
But in each json content, the id field still points to old domain project-open-data.cio.gov url.

In a mock harvesting environment it shows that if we can update the id with new location, the issue is resolved. But this means a mandatory schema content change or version update for user using this POD schema. I feel a more elegant solution is doing point-to-point redirecting rather than the who domain redirecting. If we can keep url such as https://project-open-data.cio.gov/v1.1/schema/organization.json live and permanently redirecting to https://resources.data.gov/schemas/dcat-us/v1.1/schema/organization.json, then both schema can co-exist.

@avdata99
Copy link
Contributor

@FuhuXia @adborden
This PR could help.
It's not easy to fix because the new resources continue referring to old resources
For example, inside https://resources.data.gov/schemas/dcat-us/v1.1/schema/organization.json we still call to old URL
We need to update these resources to point new links

@adborden
Copy link
Contributor

While we work out the longer-term fix with pod.c.g, I think there is a short term fix is to create a complete jsonschema based on the v1.1 schema and use that locally within the ckan extension.

Right now, we have a partial implementation of this. Our embedded schema in ckanext-datajson uses $ref to link to different entity definitions. Those references need to be fetched at runtime (probably during each harvest). Instead, we should be able to define the entire schema in a single file without references to p.c.g or r.d.g and pass that single schema to the jsonschema validator. Then we wouldn't have to worry about third-party systems (r.d.g or p.c.g) being online during harvesting.

Ultimately, we want to resolve the issue with p.c.g but we don't want to rush to an incomplete or short-sighted solution, so for this issue let's aim for a local fix.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jul 22, 2020

Good point to stick to localized schema file. As a matter of fact we are using localized an all-in-one dataset schema file. This pod.c.g redirecting issue made us aware that there is bug in the ckanext-datajson/jsonschema module that the verifier is reading external url during harvesting. Here is my other findings:

  1. The catalog.json you mentioned with external json $ref is never used. The actual schema file jsonschema uses is the dataset.json file with every definition locally defined.

  2. My test result suggests it is a bug in jsonschema when it deals with recursive self referenced definition as defined in subOrganizationOf.

  3. If we don't use recursive $ref, the issue is gone. I spelt out the organization definition within subOrganizationOf and tested on local, it works. But this means we need to manually embed the repeated definition several levels. The most complex organization I noticed so far is in GSA data.json file, it has three levels, example given upstairs.

  4. If we use "$ref": "file:/path/to/organization.json" instead of "$ref": "#" in subOrganizationOf, the issue is gone too. But this means we may have to reveal the full path of the schema file to public.

@FuhuXia FuhuXia self-assigned this Jul 22, 2020
@FuhuXia
Copy link
Member Author

FuhuXia commented Jul 22, 2020

GitHub issue created on jsonschema repo for this reading-external-url-with-local-definition issue.

@adborden
Copy link
Contributor

project-open-data.cio.gov has been restored, so this is no longer an urgent issue, but still an issue.

@willson-chen
Copy link

@adborden @FuhuXia Maybe this PR python-jsonschema/jsonschema#717 could solve the problem.

@mogul mogul added the bug Software defect or bug label Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
None yet
Development

No branches or pull requests

6 participants