A Data Product must already exist in order to attach the new components to it. The component Snowflake Storage must exist in the Data Product.
This section includes the basic information that any Component of Witboost must have:
-
Name: Displayed name for the new Data Product Component.
-
Fully Qualified Name: Workload fully qualified name, this is optional as will be generated by the system if not given by you.
-
Description: A short description to help others understand what this Workload is for.
-
Domain: The Domain of the Data Product this Workload belongs to. Be sure to choose it correctly as is a fundamental part of the Workload and cannot be changed afterwards. It has to be the same as the Data Product.
-
Data Product: The Data Product this Workload belongs to, be sure to choose the right one.
-
Identifier: Unique ID for this new entity inside the domain. Don't worry to fill this field, it will be automatically filled for you.
-
Development Group: Development group of this Data Product. Don't worry to fill this field, it will be automatically filled for you.
-
Depends On: If you want your workload to depend on other components from the Data Product, you can choose this option.
For this component, you should add the Snowflake Storage component as a dependency. If you had created a DBT Component you should include it here as well.
Example:
Field name | Example value |
---|---|
Name | Airbyte Vaccinations Ingestion |
Description | Ingestion of vaccinations data |
Domain | domain:healthcare |
Data Product | system:healthcare.vaccinations.0 |
Identifier | healthcare.vaccinations.0.airbyte-vaccinations-ingestion |
Development Group | group:datameshplatform |
Depends On | urn:dmb:cmp:healthcare:vaccinations:0:snowflake-vaccinations-storage |
This section will help Airbyte locate the file and how to read it. Right now we only support files read through HTTPS. For more information on these fields, refer to the Airbyte Files Source Documentation.
- Source name: A name to identify the source in the Airbyte instance.
- URL: The URL path to access the file available on the internet to use as source.
- File Format: The format of the file available at the specified URL. Be aware that some formats may be experimental, please refer to the Airbyte documentation for additional information about that. The default value is CSV.
- Storage Provider: The storage Provider or Location of the file(s) which should be replicated. Only HTTPS is supported at the moment.
- User-Agent: Whether to add User-Agent (HTTP header that lets servers and network peers identify information about the requesting user) to the HTTPS request.
- Dataset Name: The name of the normalized table that will be created inside the Snowflake destination, using the file specified in the related field as source (should include letters, numbers, dash and underscores only).
Example:
Field name | Example value |
---|---|
Source name | Vaccinations |
URL | https://storage.googleapis.com/covid19-open-data/v3/vaccinations.csv |
File Format | csv |
Storage Provider | HTTPS: Public Web |
User-Agent | ❌ |
Dataset Name | Vaccinations_raw |
This section will set up the Snowflake destination and tell Airbyte where to store the data. Airbyte by default will create 2 tables in the Snowflake instance: A raw table with the data in JSON format and a normalized table with the values in columns ready to be used.
- Destination name: A name to identify the Destination in the Airbyte instance.
- Database: (Optional) Enter the name of the database you want to sync data into. If left empty the Domain name will be set as default value.
- Schema: (Optional) Enter the name of the default schema. If left empty the value
<DP_name>_<DP_majorversion>
will be set as default value.
Example:
Field name | Example value |
---|---|
Destination name | Snowflake |
Database | HEALTHCARE |
Schema | vaccinations |
Lastly, we need to give a custom name to the Airbyte connection between the source File and Snowflake destination.
-
Connection Name: A name to identify the connection in the Airbyte instance. A common format is "Source Name <> Destination Name".
Remember the connection name, as you will need to connect to Airbyte on the MWAA script.
Example:
Field name | Example value |
---|---|
Connection Name | Vaccinations <> Snowflake |