You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This relates to #2963 but I wanted to create a separate issue as it is a very different method to update metadata in packages. I'm posting this here as an interesting option for other users and something to consider for inclusion as Quilt feature in future releases.
When creating packages it is usually straightforward to add package level metadata without too much effort. However, adding metadata to the individual objects can be challenging. In our case, we already store some metadata in the path to our files, such as sample IDs and several other types of entity IDs depending on the use case. Since Quilt is already includes logic to validate individual entries in a package manifest, I found a way to use that same schema to infer metadata for objects based on their path.
When Quilt performs entry validation in a workflow it generates a list of Python dictionaries, with the keys logical_key, size, and meta:
# TODO: this should be validated without fully populating array.
empty_dict= {}
defreuse_empty_dict(meta):
# Reuse the same empty dict for entries without meta
# to reduce memory usage.
returnempty_dictifmeta== {} elsemeta
return [
{
'logical_key': lk,
'size': e.size,
"meta": reuse_empty_dict(e.meta),
}
forlk, einpkg.walk()
]
The meta key refers to the user_meta subkey of the object's metadata. If you create a JSON schema that matches a logical_key using a regex pattern, it is possible to include named capture groups, e.g.:
Normally, named captures have no effect during validation other than documentation purposes. However, it is possible to extend a built in jsonschema validator with additional logic. In our case, we have updated the object properties validator to assign metadata to the meta dictionary before proceeding with validation. This is the code used to do this:
importrefromjsonschemaimportDraft7Validator, validatorsdefextend_with_meta_assignment(validator_class):
validate_properties=validator_class.VALIDATORS["properties"]
defset_meta_from_pattern(validator, properties, instance, schema):
ifnotvalidator.is_type(instance, "object"):
returnif"logical_key"inpropertiesand"meta"inproperties:
lkey_subschema=properties["logical_key"]
meta_subschema=properties["meta"]
ifvalidator.is_valid(instance.get("logical_key"), lkey_subschema):
ifnotvalidator.is_valid(instance.get("meta"), meta_subschema):
meta=instance.setdefault("meta", {})
# Pattern has to match logical_keym=re.search(lkey_subschema["pattern"], instance["logical_key"])
forprop, entity_idinm.groupdict().items():
meta[prop] =entity_id# Descend and process as normalforerrorinvalidate_properties(
validator,
properties,
instance,
schema,
):
yielderrorreturnvalidators.extend(
validator_class,
{"properties": set_meta_from_pattern},
)
MetadataAssignmentValidator=extend_with_meta_assignment(Draft7Validator)
After validation with MetadataAssignmentValidator, the object that was passed in has updated meta fields based on the named captures in the pattern. This object can be used to update each PackageEntry before building/pushing the package.
There are a couple of things to watch out for:
You want to be careful about matching multiple subschemas. The oneOf property is useful here:
"type": "array",
"items": {
"oneOf": [ {...} ]
}
Directly using the get_pkg_entries_for_validation function from the linked code above would be a mistake because it uses an optimization to save on memory be reusing a single empty dictionary when no metadata is already present on package entries. This could lead to all fields being present on all items since potentially every item's meta would be a reference to the same object.
This only works for Python-style regular expressions. JS named captures use a different syntax so if you want to maintain a single set of entry schemas for validation and setting metadata Quilt has to continue using a Python JSON schema implementation.
The text was updated successfully, but these errors were encountered:
This relates to #2963 but I wanted to create a separate issue as it is a very different method to update metadata in packages. I'm posting this here as an interesting option for other users and something to consider for inclusion as Quilt feature in future releases.
When creating packages it is usually straightforward to add package level metadata without too much effort. However, adding metadata to the individual objects can be challenging. In our case, we already store some metadata in the path to our files, such as sample IDs and several other types of entity IDs depending on the use case. Since Quilt is already includes logic to validate individual entries in a package manifest, I found a way to use that same schema to infer metadata for objects based on their path.
When Quilt performs entry validation in a workflow it generates a list of Python dictionaries, with the keys
logical_key
,size
, andmeta
:quilt/api/python/quilt3/workflows/__init__.py
Lines 264 to 280 in 7051b2b
The
meta
key refers to theuser_meta
subkey of the object's metadata. If you create a JSON schema that matches alogical_key
using a regexpattern
, it is possible to include named capture groups, e.g.:Normally, named captures have no effect during validation other than documentation purposes. However, it is possible to extend a built in
jsonschema
validator with additional logic. In our case, we have updated the objectproperties
validator to assign metadata to themeta
dictionary before proceeding with validation. This is the code used to do this:After validation with
MetadataAssignmentValidator
, the object that was passed in has updatedmeta
fields based on the named captures in thepattern
. This object can be used to update eachPackageEntry
before building/pushing the package.There are a couple of things to watch out for:
oneOf
property is useful here:get_pkg_entries_for_validation
function from the linked code above would be a mistake because it uses an optimization to save on memory be reusing a single empty dictionary when no metadata is already present on package entries. This could lead to all fields being present on all items since potentially every item'smeta
would be a reference to the same object.The text was updated successfully, but these errors were encountered: