This feature requires the
quilt3
API version 3.3 or higher.
A Quilt workflow is a quality gate that you set on a bucket to ensure the quality of your data and metadata before it becomes a Quilt package. You can create as many workflows as you like to accommodate all of your data creation patterns.
Under the hood, Quilt workflows use JSON Schema to check that package metadata have the right shape. Metadata shape determines which keys are defined, their values, and the types of the values.
Ensuring the quality of your data has long-lasting implications:
- Consistency — if labels and other metadata don't use a consistent, controlled vocabulary, reuse becomes difficult and trust in data declines
- Completeness — if your workflows do not require users to include files, documentation, labels, etc. then your data is on its way towards becoming mystery data and ultimately junk data that no one can use
- Context — data can only be reused if users know where it came from, what it means, who touched it, and what the related datasets are
From the standpoint of querying engines like Amazon Athena, metadata that lacks consistency and completeness is extremely difficult to query longitudinally and depreciates over time (as team members change, platforms change, and tribal knowledge is lost).
- Ensure that labels are correct and drawn from a controlled vocabulary (e.g. ensure that the only labels in a package of images are either "bird" or "not bird"; avoid data entry errors like "birb")
- Ensure that users provide a
README.md
for every new package - Ensure that included files are non-empty
- Ensure that every new package (or dataset) has enough labels so that it can be reused (e.g. Date, Creator, Type, etc.)
To get started, create a configuration file in your Quilt S3 bucket
at s3://BUCKET-NAME/.quilt/workflows/config.yml
.
Here's a complex example:
version:
base: "1"
catalog: "1"
workflows:
alpha:
name: Search for aliens
is_message_required: true
beta:
name: Studying superpowers
metadata_schema: superheroes
gamma:
name: Nothing special
description: TOP SECRET
is_message_required: true
metadata_schema: top-secret
handle_pattern: ^(employee1|employee2)/(staging|production)$
entries_schema: must-have-foo-bar-meta
catalog:
package_handle:
files: <%= username %>/<%= directory %>
packages: <%= username %>/production
schemas:
superheroes:
url: s3://quilt-dev-metadata/schemas/superheroes.schema.json
top-secret:
url: s3://quilt-dev-metadata/schemas/top-secret.schema.json
must-have-foo-bar-meta:
url: s3://quilt-dev-metadata/schemas/must-have-foo-bar-meta.schema.json
Once a configuration file is present, you usually must specify a workflow
when pushing a package from quilt3
to that bucket. You can however specify a
default_workflow
value at the top-level in your config.yml
file:
version:
base: "1"
catalog: "1"
default_workflow: "experiment"
is_workflow_required: False
workflows:
experiment:
name: Experiment
metadata_schema: experiment-universal
schemas:
experiment-universal:
url: s3://quilt-dev-metadata/.quilt/workflows/schemas/experiment-universal.json
This specifies which workflow will be used (experiment
) if a
workflow
parameter in the Package.push()
API call or CLI is not provided.
Otherwise, all push
calls must specify a workflow or the push will fail.
If you wish for your users to be able to skip workflow validation altogether,
you can make workflow validation optional with is_workflow_required: False
at
the top-level in your config.yml
file:
version:
base: "1"
catalog: "1"
is_workflow_required: False
Now your users can specify workflow=None
in the Python API (or --workflow ''
in the CLI) when they push packages (or drop the workflow
parameter
altogether).
quilt3.Package().push(
'test/package',
registry='s3://quilt-dev-metadata',
workflow=None)
# Package test/package@06b2815 pushed to s3://quilt-dev-metadata
Otherwise, once you have a configuration you must specify a workflow when pushing a package:
import quilt3
quilt3.Package().push('test/package', registry='s3://quilt-dev-metadata')
# QuiltException: Workflow required, but none specified.
Let's retry, adding the workflow=
parameter:
quilt3.Package().push('test/package', registry='s3://quilt-dev-metadata', workflow='alpha')
# QuiltException: Commit message is required by workflow, but none was provided.
The above QuiltException
is caused by is_message_required: true
.
Here's how we can pass the workflow:
quilt3.Package().push(
'test/package',
registry='s3://quilt-dev-metadata',
message='added info about UFO',
workflow='alpha')
# Package test/package@bc9a838 pushed to s3://quilt-dev-metadata
Now let's push with workflow='beta'
:
quilt3.Package().push(
'test/package',
registry='s3://quilt-dev-metadata',
workflow='beta')
# QuiltException: Metadata failed validation: 'superhero' is a required property.
We encountered another exception because the beta
workflow specifies
metadata_schema: superheroes
.
Therefore, the test/package
metadata must validate against the
JSON Schema located at
s3://quilt-dev-metadata/schemas/superheroes.schema.json
:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/superheroes.schema.json",
"properties": {
"superhero": {
"enum": [
"Spider-Man",
"Superman",
"Batman"
]
}
},
"required": [
"superhero"
]
}
Note that superhero
is a required property of the package metadata, which is
specified using set_meta
.
quilt3.Package().set_meta({'superhero': 'Batman'}).push(
'test/package',
registry='s3://quilt-dev-metadata',
workflow='beta')
# Package test/package@c4691d8 pushed to s3://quilt-dev-metadata
For the gamma
workflow, both is_message_required: true
and metadata_schema
are set, so both message
and package metadata are validated:
quilt3.Package().push(
'test/package',
registry='s3://quilt-dev-metadata',
workflow='gamma')
# QuiltException: Metadata failed validation: 'answer' is a required property.
quilt3.Package().set_meta({'answer': 42}).push(
'test/package',
registry='s3://quilt-dev-metadata',
workflow='gamma')
# QuiltException: Commit message is required by workflow, but none was provided.
quilt3.Package().set_meta({'answer': 42}).push(
'test/package',
registry='s3://quilt-dev-metadata',
message='at last all is set up',
workflow='gamma')
# Package test/package@6331508 pushed to s3://quilt-dev-metadata
In addition to package-level metadata. Quilt workflows enable you to validate package names, and basic file metadata.
You must include the following schema version at the root of your config.yml in order for any catalog-specific features to function:
version:
base: "1"
catalog: "1"
By default the Quilt catalog auto-fills the package handle prefix when creating or revising a package according to the following logic:
- Packages tab: username (everything before the @ in your sign-in email). Equivalent to
catalog:
package_handle:
packages: <%= username %>
- Files tab: parent directory name. Equivalent to
catalog:
package_handle:
files: <%= directory %>
You can customize the default prefix with package_handle
key in one or both of
the following places:
-
Set
catalog.package_handle.(files|packages)
at the root of config.yml to affect all workflows -
Set
workflows.WORKFLOW.catalog.package_handle.(files|packages)
to affect the tabs and workflow in question
catalog:
# default for all workflows for Packages tab
package_handle:
packages: analysis/
workflows:
my-workflow:
name: My workflow
catalog:
# defaults for my-workflow, different for each tab
package_handle:
files: <%= username %>/<%= directory %>
packages: <%= username %>/production
You can validate package names with WORKFLOW.handle_pattern
, which accepts
JavaScript regular
expression.
By default, patterns are not anchored. You can explicitly add start (
^
) and end ($
) markers as needed.
workflows:
name: My workflow
my-workflow:
handle_pattern: ^(employee1|employee2)/(production|staging)$
You can validate the names, sizes and metadata of files in the package with
WORKFLOW.entries_schema
. The provided schema runs against an array of
objects known as package entries. Each package entry defines a logical key
(its relative path and name in the parent package), size (in bytes) and metadata.
workflows:
myworkflow-1:
name: 'My workflow #1'
entries_schema: must-contain-readme
myworkflow-2:
name: 'My workflow #2'
entries_schema: must-contain-readme-summarize-at-least-1byte
description: Must contain non-empty README.md and quilt_summarize.json
at package root; no more than 4 files
myworkflow-3:
name: 'My workflow #3'
entries_schema: must-have-foo-bar-meta
description: Must contain at least one file with { foo: bar } metadata object
schemas:
must-contain-readme:
url: s3://bucket/must-contain-readme.json
must-contain-readme-summarize-at-least-1byte:
url: s3://bucket/must-contain-readme-summarize-at-least-1byte.json
must-have-foo-bar-meta:
url: s3://bucket/must-have-foo-bar-meta.json
Requires a README
{
"type": "array",
"contains": {
"type": "object",
"properties": {
"logical_key": {
"type": "string",
"pattern": "^README\\.md$"
}
}
}
}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"allOf": [
{
"type": "array",
"items": {
"type": "object",
"properties": {
"size": {
"type": "number",
"minimum": 1,
"maximum": 100000
}
}
},
"minItems": 2,
"maxItems": 4
},
{
"type": "array",
"contains": {
"type": "object",
"properties": {
"logical_key": {
"type": "string",
"pattern": "^README\\.md$"
}
}
}
},
{
"type": "array",
"contains": {
"type": "object",
"properties": {
"logical_key": {
"type": "string",
"pattern": "^quilt_summarize\\.json$"
}
}
}
}
]
}
Requires { "foo": "bar" }
object as user specified metadata in README.md
{
"type": "array",
"contains": {
"type": "object",
"properties": {
"logical_key": {
"type": "string",
"pattern": "^README\\.md$"
},
"meta": {
"type": "object",
"properties": {
"foo": {
"type": "string",
"pattern": "^bar$"
}
},
"required": ["foo"]
}
}
}
}
In Quilt, S3 buckets are like git branches but for data. With quilt3
you can
browse
any package and then push
it to any bucket that you choose.
As a rule, cross-bucket pushes or "merges" reflect change in a package's lifecycle. For example, you might push a package from my-staging-bucket to my-production-bucket as it matures and becomes trusted.
The catalog's
Push to bucket
feature can be enabled by adding a successors
property to the config.
A successor is a destination bucket.
version:
base: "1"
workflows:
dummy:
name: Dummy
successors:
s3://bucket1:
title: Staging
copy_data: false
s3://bucket2:
title: Production
If copy_data
is true
(the default), all package entries will be copied to
the destination bucket. If copy_data
is false
, all entries will remain in
their current locations.
- Quilt workflows support the Draft 7 JSON Schema.
- JSON schemas can be stored anywhere in any accessible Amazon S3 bucket.
Provided the path to the file is accessible in
config.yml
, the schema will successfully validate your package metadata shape.
Quilt supports the
default
keyword.
If you wish to pre-populate dates in the Quilt catalog, you can use the custom
keyword dateformat
in your schemas. For example:
{
"type": "string",
"format": "date",
"dateformat": "yyyy-MM-dd"
}
The dateformat
template follows
Unicode Technical Standard #35.
Quilt supports the array
data
type.
You can use array
if you need to define a list of metadata values for a
metadata key. These elements can be of any type.
If the order in the list is not significant, use "arrays" (using "items"
and "anyOf"
):
{
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
},
{
"type": "number"
}
]
}
}
With this Schema you can create a list of metadata values such as:
["Any string A", 123, "Any string B"]
or [123, "Any string", 456]
If the order in the list is important and the list is fixed in
length, then use "tuples" (using "items"
, "minItems"
, and "maxItems"
):
{
"type": "array",
"items": [
{
"type": "string"
},
{
"type": "number"
}
],
"minItems": 2,
"maxItems": 2
}
With this Schema you can create strictly ordered lists, such as ["Any string", 123]
.
An incorrect order will return an error [123, "Any string"] // invalid
.
Remember that you should define
"minItems"
and"maxItems"
or"minItems"
and"additionalItems": false
, because "tuples" must have a fixed size.
Instead of letting users set any metadata value, you can define list of
available options with enum
:
{
"type": "array",
"items": {
"type": "string",
"enum": ["Fixed 1", "Fixed 2"]
}
}
With this Schema you can create a list of any length
with predefined values, such as ["Fixed 1", "Fixed 2", "Fixed 1"]
.
{
"type": "array",
"items": [
"type": "string",
"enum": ["Fixed 1", "Fixed 2"]
],
"minItems": 1,
"additionalItems": false,
}
With this Schema users are allowed to create tuples like ["Fixed 1"]
or
["Fixed 2"]
.
If you want to provide users with a list of predefined metadata values but
additionally let them add any values outside of this list, you can use the anyOf
keyword:
{
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
"enum": ["Fixed 1", "Fixed 2"]
},
{
"type": "string"
}
]
}
}
Metadata lists such as
["Fixed 1", "Fixed 2"]
, ["Fixed 1", "Any string"]
or ["Any string 1", "Any string 2"]
are all valid.
In certain use cases you may want to define metadata lists that
have first-ordered items of predefined values, and the rest are any
other outside of the predefined values. Then you create
tuples with "additionalItems": true
:
{
"type": "array",
"items": [
"type": "string",
"enum": ["Fixed 1", "Fixed 2"]
],
"minItems": 1,
"additionalItems": true,
}
With this Schema lists such as
["Fixed 1", "Any string", 123]
are valid but ["Any string", 123]
are invalid.
The following examples show how you can specify complex properties
such as object
, array
, and compound enum
types.
{
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"default": 123,
"type": "number"
},
"name": {
"default": "Optional default value",
"type": "string"
}
}
}
}
{
"type": "array",
"enum": [
[1, 2, 3],
[3, 4, 5],
[6, 7, 8]
]
}
{
"type": "object",
"enum": [
{"id": 1},
{"id": 2},
{"id": 3}
]
}
{
"type": "array",
"enum": [
["miles", {
"format": "12h"
}],
["kilometers", {
"format": "24h"
}],
{
"name": "unspecified"
}
]
}
This allows for flexible and extensible schema definition, and hence validation, of complex metadata schemas to any depth.
Quilt currently uses the Draft 7 Json Schema where tuples are validated with
items
, and notprefixItems
. TheprefixItems
keyword was added in Draft 2020-12, and is not currently supported.
See workflows-config_catalog-1.0.0.json and workflows-config-1.1.0.json.
- Only Draft 7 Json
Schemas are
supported
- If a workflow schema includes a non-supported keyword, the user
interface displays an
unknown keyword: <non-supported keyword>
error
- If a workflow schema includes a non-supported keyword, the user
interface displays an
- Schemas with
$ref
are not supported - Schemas must be in an S3 bucket for which the Quilt user has read permissions