Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.3.0 deployment #57

Merged
merged 16 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/vlmd_validation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest jsonschema frictionless
pip install -r requirements.txt
- name: Test with pytest
run: |
pytest
2 changes: 1 addition & 1 deletion VERSIONS.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"slmd":"1.0.0",
"vlmd":"0.2.0"
"vlmd":"0.3.0"
}
5 changes: 2 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
git+https://github.com/norc-heal/json-schema-for-humans.git@develop
frictionless
jsonschema
pytest
jinja2
jinja2
pandas
62 changes: 32 additions & 30 deletions variable-level-metadata-schema/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,29 @@ This metadata directory contains the specifications for variable level metadata
❗ Look here for schema specifications.

### json data dictionary format specification
1. `schemas/jsonschema/data-dictionary.json`: The "json" json data dictionary schema (ie json template schema)

1. `schemas/data-dictionary.json`: The "json" json data dictionary schema (ie json template schema)
- Intended to specify the data dictionary representation of json objects available in the HEAL platform metadata-service.
- See here for the markdown rendered version --> [`docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md`](docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md)
- See here for the markdown rendered version --> [`docs/jsontemplate-data-dictionary.md`](docs/jsontemplate-data-dictionary.md)

### csv field format specifications
- See here for the markdown rendered version --> [`docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md`](docs/md-rendered-schemas/jsonschema-csvtemplate-fields.md)

2. `schemas/csvtemplate/fields.json`: The "csv" json schema (ie csv template schema)

- See here for the markdown rendered version --> [`docs/csvtemplate-fields.md`](docs/csvtemplate-fields.md)


2. `schemas/frictionless/fields.json` Table schema (previously known as "frictionless") standard specification
- This json file is intended to represent csv data dictionary documents following the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/).
- Csv version is intended to make data dictionary creation and discovery available in a more familiar/human readable format,
- The representation of data dictionary field values in a csv file. It's used to facilitate documentation of data dictionary csv
files in addition to input validation.
3. `schemas/jsontemplate/fields.json`The "csv" json schema (ie csv template schema)
- :warning: The "csv" json schema is intended to be an intermediate specification used for documentation and in translation workflows to the json schema template. As fully specifying a tabular file (for example missing value specification) is out of scope here (see the table schema representation in (2))
- Csv version is intended to make data dictionary creation and discovery available in a more familiar/human readable format,
- The representation of data dictionary field values in a csv file. It's used to facilitate documentation of data dictionary csv
files in addition to input validation.

- :warning: The "csv" json schema is intended to be an intermediate specification used for documentation and in translation workflows to the json data dictionary. As fully specifying a tabular file (for example missing value specification) is out of scope here (see the table schema representation in (2))


## Document flow chart

```mermaid

%%{init: {"flowchart": {"defaultRenderer": "elk","htmlLabels": false}} }%%

flowchart TD

subgraph dictionary[Dictionary YAML files]
Expand All @@ -40,22 +41,21 @@ This metadata directory contains the specifications for variable level metadata

subgraph Schema specifications

jsonspec["schema/jsontemplate/data-dictionary.json"]
csvspec["schema/jsontemplate/csvtemplate/fields.json"]
csvtblspec["schema/frictionless/csvtemplate/fields.json"]
jsonspec["schema/data-dictionary.json"]
csvspec["schema/csvtemplate/fields.json"]
end

subgraph "Rendered schema documentation \n(html also available)"
subgraph "Rendered schema documentation"

csvmd["/docs/\nmd-rendered-schemas/\njsonschema-csvtemplate-fields.md"]
jsonmd["/docs/\nmd-rendered-schemas/\njsonschema-jsontemplate-data-dictionary.md"]
csvmd["/docs/csvtemplate-fields.md"]
jsonmd["/docs/jsontemplate-data-dictionary.md"]

end

defs --> fields --> dd
defs --> dd

fields --> csvspec --> csvtblspec
fields --> csvspec
dd --> jsonspec

csvspec --> csvmd
Expand All @@ -68,9 +68,8 @@ This metadata directory contains the specifications for variable level metadata
- `docs`:
See the rendered human readable schemas
in a markdown format and an interactive html format.
- `schemas/jsonschema`: contains the final and full specification for schemas following json schema.
- `schemas/frictionless`: contains schemas following the frictionless table schema specifications. See [here](https://specs.frictionlessdata.io/table-schema/) for the specification.
- `schemas/dictionary`: the yaml files used to generate json schemas and documentation with build.py.
- `schemas/*.json`: contains the final and full specification for schemas following json schema.
- `schemas/dictionary/*.yaml`: the yaml files used to generate json schemas and documentation with build.py.
- `templates`: empty templates in csv spreadsheet format and JSON format.
- `examples`: exapmles of filled out templates in csv spreadsheet format and JSON format.
- `build.py`: This script compiles the yaml files and generates associated schemas in addition to the human rendered schema
Expand Down Expand Up @@ -104,8 +103,8 @@ Given csv field values can only be scalar values with records separated by a new
- if type `object` in `items`: flattened to the children property or properties
- if type is a scalar (`string`,`integer`,`number`) in `items`,
translated to type `string` with pattern `^(?:[^|]+\||[^|]*)(?:[^|]*\|)*[^|]*$` to indicate a string containing a pipe delimiter (i.e., a stringified array with a pipe delimiter)
### `property` name conversion rules
To facilitate the mapping of json spec property names to csv property names, the resulting flattened `property` names from the flattened properties should correspond to the [jsonpath](https://datatracker.ietf.org/doc/id/draft-goessner-dispatch-jsonpath-00.html) representation where:
### `property` name conversion rules (ie Representing nested arrays and objects in csv documents)
To facilitate the mapping of json spec property names to csv property names, the resulting flattened `property` names from the flattened properties should correspond to the [jsonpath](https://datatracker.ietf.org/doc/id/draft-goessner-dispatch-jsonpath-00.html) representation as a `patternProperty`:

1. type `object`

Expand Down Expand Up @@ -152,13 +151,15 @@ To facilitate the mapping of json spec property names to csv property names, th
}}}

```
translates to the csv stringified type array property:
translates to the csv stringified type array `patternProperty`:

```json
{ "..more props..":"...",
"standardsMappings[0].instrument.url": {
"type": "string",
"format": "uri"
"patternProperties":{
"^standardsMappings[\\d+].instrument.url$": {
"type": "string",
"format": "uri"
}
}
}
```
Expand All @@ -167,7 +168,7 @@ To facilitate the mapping of json spec property names to csv property names, th

1. Currently, no complex types (`anyOf`,`oneOf`) are supported and the `type` MUST be specified. This is to ensure coverage for all csv to json translation use cases.
- Each json specification schema property type must be a scalar (e.g., `boolean`,`string`,`integer`,`number`), an `array`, or an `object`
- Each csv specification schema property type must be a scalar (e.g., `boolean`,`string`,`integer`,`number`)
- Each csv specification schema property type must be a scalar (e.g., `boolean`,`string`,`integer`,`number`) but see note on stringified arrays and objects.

### csv to json and json to csv translations

Expand Down Expand Up @@ -207,6 +208,7 @@ a core HEAL property. To allow these properties to be included, we list these pr

One consideration, however, is that `propertyNames` was introduced in json schema draft-6.


## Considerations

Please use github issues for any additional considerations. See additional comments above.
Please use github issues for any additional considerations. See additional comments above.
142 changes: 43 additions & 99 deletions variable-level-metadata-schema/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
from collections.abc import MutableMapping, MutableSequence, MutableSet,Sequence
from functools import reduce
import jsonschema
from json_schema_for_humans.generate import generate_from_filename
import jinja2
import json

Expand Down Expand Up @@ -109,7 +108,7 @@ def to_csv_properties(schema,**additional_props):

return csv_schema

def flatten_properties(properties, parentkey="", sep=".",itemsep="[0]"):
def flatten_properties(properties, parentkey="", sep=".",itemsep="\[\d+\]"):
"""
flatten schema properties
"""
Expand Down Expand Up @@ -141,75 +140,32 @@ def flatten_properties(properties, parentkey="", sep=".",itemsep="[0]"):

def flatten_schema(schema):
schema_flattened = dict(schema)
properties = schema.get("properties")
if properties:
schema_flattened["properties"] = flatten_properties(properties)
return schema_flattened

def _to_frictionless_field(propname, prop, schema):
get_anyof = lambda propname: [
_prop.get(propname) for _prop in prop.get("oneOf", [])
]

# anyof is convenient way to reference multiple enum lists of same type
anyof = {
"type": [t for t in get_anyof("type") if t],
"enum": [val for enumlist in get_anyof("enum") for val in enumlist],
}
jsonfields = {
"name": propname,
"description": prop.get("description"),
"title": prop.get("title"),
"examples": prop.get("examples"),
"type": list(set(anyof.get("type", []) + [p for p in [prop.get("type")] if p])),
"enum": list(set(anyof.get("enum", []) + prop.get("enum", []))),
"pattern": prop.get("pattern"),
}
# add required
if propname in schema.get("required", []):
jsonfields["required"] = True

constraintfields = ["enum", "pattern", "required"]
targetfield = {}

for propname, prop in jsonfields.items():
if propname == "type":
targetfield[propname] = prop[0] if len(prop) == 1 else "any"
elif propname in constraintfields and prop:
if targetfield.get("constraints"):
targetfield["constraints"][propname] = prop
else:
targetfield["constraints"] = {propname: prop}
elif prop:
targetfield[propname] = prop

return targetfield


def to_frictionless(schema):
assert schema["type"] == "object"
assert "properties" in schema

frictionless_schema = {}

# schema level annotations
for propname in ["description", "title", "name", "examples"]:
if schema.get(propname):
frictionless_schema[propname] = schema[propname]

# get fields subschema
fields = schema["properties"]
frictionless_fields = []
for name, field in fields.items():
assert isinstance(field, MutableMapping), "all field properties must be jsons"
frictionless_fields.append(_to_frictionless_field(name, field, schema))

frictionless_schema["fields"] = frictionless_fields
frictionless_schema["missingValues"] = [
""
] # TODO: have a way to specify if anyOf is a missing val
return frictionless_schema
if "properties" in schema:
properties = schema_flattened.pop("properties")
item_sep = "\[\d+\]"
schema_flattened["properties"] = flatten_properties(properties,itemsep=item_sep)
schema_flattened["patternProperties"] = {}
for propname in list(schema_flattened["properties"].keys()):
if item_sep in propname:
var0 = propname.replace(item_sep,"[0]")
var1 = propname.replace(item_sep,"[1]")
var2 = propname.replace(item_sep,"[2]")
pattern_property_note = (
"\n\n"
"Specifying field names:\n\n"
"This field can have 1 or more columns using the digit index number in brackets (`[0]` --> `[1]` --> `[n]`)\n\n"
"For 1 value, you will have the field (column) names:\n"
"`{0}`\n\n"
# "\tFor 2 values, you will have the columns: "
# "`{0},`{1}`\n"
"For 3 values, you will have the field (column) names:\n"
"`{0}`\t`{1}`\t`{2}`\n\n"
).format(var0,var1,var2)
pattern_prop = schema_flattened["properties"].pop(propname)
pattern_prop["description"] = pattern_prop.get("description","") + pattern_property_note
schema_flattened["patternProperties"]["^"+propname+"$"] = pattern_prop

return schema_flattened

def run_pipeline_step(input, step):
"""function for input into the reduce functool
Expand All @@ -229,6 +185,7 @@ def run_pipeline_step(input, step):
raise Exception("Step must be at least of length 1")

def render_markdown(item,schema,templatefile):

env = jinja2.Environment(
loader=jinja2.FileSystemLoader("docs/assets/templates"),
trim_blocks=True,
Expand All @@ -242,6 +199,17 @@ def render_markdown(item,schema,templatefile):

def generate_template(schema):
template = {}
schema = dict(schema)
if 'patternProperties' in schema:
schema["properties"] = schema.get("properties",{})
for patternname,prop in schema["patternProperties"].items():
propname = (
patternname
.replace("^","")
.replace("$","")
.replace("\[\d+\]","[0]")
)
schema["properties"][propname] = prop
if 'properties' in schema:
for prop, prop_schema in schema['properties'].items():
if 'type' in prop_schema:
Expand All @@ -258,7 +226,7 @@ def generate_template(schema):
ref_schema = get_referenced_schema(prop_schema['$ref'])
template[prop] = generate_template(ref_schema)
return template

if __name__ == "__main__":
# compile frictionless schema fields
dictionary = load_all_yamls()
Expand All @@ -272,25 +240,9 @@ def generate_template(schema):
(lambda _schema: {"version":versions["vlmd"],**_schema},None)
]
json_data_dictionary = reduce(run_pipeline_step, json_pipeline, dictionary)
Path("schemas/jsonschema/data-dictionary.json").write_text(json.dumps(json_data_dictionary, indent=4))
Path("schemas/data-dictionary.json").write_text(json.dumps(json_data_dictionary, indent=4))

schema_version_prop = {"schemaVersion":json_data_dictionary["properties"]["schemaVersion"]}
csv_pipeline = [
# recursive fxn so need to grab items from overall dictionary for json paths
(resolve_refs, {"schema": dictionary}),
# no longer need the definitons as they have been resolved
(lambda _schema: _schema["fields"], None),
(flatten_schema, None),
(to_csv_properties,schema_version_prop),
(to_frictionless, None),
(lambda _schema: {"version":versions["vlmd"],**_schema},None)
]
frictionlessfields = reduce(run_pipeline_step, csv_pipeline, dictionary)
Path("schemas/frictionless/csvtemplate/fields.json").write_text(
json.dumps(frictionlessfields, indent=2)
)


# compile json schema fields
csv_pipeline = [
# recursive fxn so need to grab items from overall dictionary for json paths
Expand All @@ -302,15 +254,7 @@ def generate_template(schema):
(lambda _schema: {"version":versions["vlmd"],**_schema},None)
]
csvfields = reduce(run_pipeline_step, csv_pipeline, dictionary)
Path("schemas/jsonschema/csvtemplate/fields.json").write_text(json.dumps(csvfields, indent=4))

# generate json schema versions of field schemas for documentation

# generate html using the json-schema for human library
generate_from_filename("schemas/jsonschema/csvtemplate/fields.json",
"docs/html-rendered-schemas/jsonschema-csvtemplate-fields.html")
generate_from_filename("schemas/jsonschema/data-dictionary.json",
"docs/html-rendered-schemas/jsonschema-jsontemplate-data-dictionary.html")
Path("schemas/csvtemplate/fields.json").write_text(json.dumps(csvfields, indent=4))

# render and write markdown versions
csvfields_md = render_markdown(
Expand All @@ -322,8 +266,8 @@ def generate_template(schema):
schema=json_data_dictionary,
templatefile="jsontemplate.md"
)
Path("docs/md-rendered-schemas/jsonschema-csvtemplate-fields.md").write_text(csvfields_md)
Path("docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md").write_text(json_dd_md)
Path("docs/csvtemplate-fields.md").write_text(csvfields_md)
Path("docs/jsontemplate-data-dictionary.md").write_text(json_dd_md)

# generate templates
Path("templates/template_submission.json").write_text(json.dumps([generate_template(json_data_dictionary)],indent=4))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,17 @@ The aim of this HEAL metadata piece is to track and provide basic information ab

{% for itemname,item in schema.properties.items() %}
{% include 'properties.md' %}

------

{% endfor %}

{% for itemname,item in schema.patternProperties.items() %}
{% set itemname = itemname.replace("^","").replace("$","").replace("\[\d+\]","[`number`]") %}
{% include 'properties.md' %}

------
{% endfor %}

## End of schema - Additional Property information

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ _version {{ schema.version }}_
### Properties for each `fields` record
{% set schema = item['items'] %}
{% for itemname,item in item['items']['properties'].items() %}

{% include 'properties.md' %}

------

{% endfor %}
{% endif %}
{% endfor %}
Expand Down
Loading
Loading