initial commit

godatadriven · Jul 2, 2024 · 9671094 · 9671094
1 parent 53bf111
commit 9671094
Show file tree

Hide file tree

Showing 16 changed files with 156,973 additions and 156,013 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,61 @@
-# 🚧 This repo has been archived
-## 👇🏻 Please use one of the following options instead
-- I want a[ dbt-focused Jaffle Shop project](https://jaffle.sh/) that works with dbt Cloud or dbt Core with any adapter or setup.
-- I want a [fork of the repo that was here](https://github.com/meltano/jaffle-shop-template) maintained by Meltano.
-- I want a [community-maintained DuckDB + dbt + Evidence focused project](https://github.com/gwenwindflower/octocatalog) stewarded by the original author of this repo [@gwenwindflower](https://github.com/gwenwindflower).
+# 🥪 The Jaffle Shop 🦘
+
+This is a template for creating a fully functional dbt project for teaching, learning, writing, demoing, or any other scenarios where you need a basic project with a synthesized jaffle shop business.
+
+## How to use
+
+### 1. Click the big green 'Use this template' button and 'Create a new repository'.
+
+![Click use template](.github/static/use-template.gif)
+
+This will create a new repository exactly like this one, and navigate you there. Make sure to execute the next instructions in that repo.
+
+### 2. Click 'Code', then 'Codespaces, then 'Create codespace on main'.
+
+![Create codespace on main](.github/static/open-codespace.gif)
+
+This will create a new `codespace`, a sandboxed devcontainer with everything you need for a dbt project. Once the codespace is finished setting up, you'll be ready to run a `dbt build`.
+
+### 3. Make sure to wait til the codespace is finished setting up.
+
+![Codespaces setup screen at postCreateCommand](.github/static/codespaces-setup-screen.png)
+
+After the container is built and connected to, VSCode will run a few clean up commands and then a `postCreateCommand`, a set of commands run after the container is set up. This is where we install our dependencies, such as dbt, the duckdb adapter, and other necessities, as well as run `dbt deps` to install the dbt packages we want to use. That screen will look something like the above, when its completed it will close and leave you in a fresh terminal prompt. From there you're ready to do some analytics engineering!
+
+## Using with Meltano
+
+This project is preconfigured with a Meltano configuration file, `meltano.yml`. Meltano can be used as follows:
+
+One-time workstation setup:
+
+```console
+> meltano install  # Install the plugins declared by the project
+```
+
+Sample usage for end-to-end development:
+
+```console
+> meltano run el    # Run the job titled 'el' to extract and load data
+> meltano run t     # Run the job titled 't' to transform data
+> meltano run bi    # Build and serve the Evidence BI reports
+```
+
+Dynamically Build and serve the Evidence BI reports:
+
+```
+meltano invoke evidence:dev    # 
+```
+
+Do a full end-to-end build on "prod":
+
+```console
+> meltano --environment=prod run elt evidence:build
+```
+
+## Contributing
+
+We welcome issues and PRs requesting or adding new features. The package that generates the synthetic data, [`jafgen`](https://pypi.org/project/jafgen/), is also under active development, and will add more types of source data to model as we go along. If you have tests, descriptions, new models, metrics, materializations types, or techniques you use this repo to demonstrate, which you feel would make for a more expansive baseline experience, we encourage you to consider contributing them back in so that this project becomes an even better collective tool for exploring and learning dbt over time.
+
+## Anything else?
+
+That's it! We jaff'd, we cried, we learned about life. If you have any questions or see missing documentation, that's also super helpful to contribute back in via an issue or PR.
diff --git a/Taskfile.yml b/Taskfile.yml
diff --git a/jaffle-data/raw_customers.csv b/jaffle-data/raw_customers.csv
diff --git a/jaffle-data/raw_items.csv b/jaffle-data/raw_items.csv
diff --git a/jaffle-data/raw_orders.csv b/jaffle-data/raw_orders.csv
diff --git a/jaffle-data/raw_stores.csv b/jaffle-data/raw_stores.csv
@@ -1,6 +1,6 @@
 id,name,opened_at,tax_rate
-7f790ed7-0fc4-4de2-a1b0-cce72e657fc4,Philadelphia,2016-09-01T00:00:00,0.06
-08d44615-06d3-4086-a5d7-21395a1d975e,Brooklyn,2017-03-12T00:00:00,0.04
-f6f2bd97-becb-4e1c-a611-20c7cf579841,Chicago,2018-04-29T00:00:00,0.0625
-48b0172c-4490-4f05-b290-e69f418d0575,San Francisco,2018-05-09T00:00:00,0.075
-ed2af26d-35a1-4a31-ac65-7aedcaa7b7a7,New Orleans,2019-03-10T00:00:00,0.04
+74d66a05-2e08-41d6-b743-e5fe5ba754b4,Philadelphia,2016-09-01T00:00:00,0.06
+de4cda82-d821-4a4f-85f1-d88a6ea32fc2,Brooklyn,2017-03-12T00:00:00,0.04
+38e46ab3-e2c4-4453-81e6-fc4c4f4bfb11,Chicago,2018-04-29T00:00:00,0.0625
+f62a04a4-0237-45bf-bdc2-ac78ff46c962,San Francisco,2018-05-09T00:00:00,0.075
+73648058-5335-44b2-9438-65fb3533829d,New Orleans,2019-03-10T00:00:00,0.04
diff --git a/meltano.yml b/meltano.yml
@@ -1,46 +1,104 @@
 # Meltano Configuration File
+# 
+# One-time workstation setup:
+# > meltano install  # Install the plugins declared by the project
 #
-# Sample usage:
-# > meltano run tap-jaffle-shop target-duckdb
+# Sample usage for end-to-end development:
+# > meltano run el    # Run the job titled 'el' to extract and load data
+# > meltano run t     # Run the job titled 't' to transform data
+# > meltano run bi    # Build and serve the Evidence BI reports
 #
-# Or equivalently:
-# > meltano run el  # Run the job named 'el' to extract and load data
+# Repeat the same actions as above on "prod":
+# > meltano --environment=prod run elt evidence:build
 
 version: 1
-project_id: Jaffle Shop Template Project
 
-env:
-  JAFFLE_DB_PATH: ./reports/jaffle_shop.duckdb
-  JAFFLE_DB_NAME: jaffle_shop
-  JAFFLE_RAW_SCHEMA: jaffle_raw
+jobs:
+  # Sample usage: `meltano run el`, `meltano run t`, `meltano run el t`, `meltano run elt`
+- name: el         # Extract and load the raw data
+  tasks:
+  - tap-jaffle-shop target-duckdb
+- name: t          # Transform the raw data
+  tasks:
+  - dbt-duckdb:run
+  - dbt-duckdb:test
+- name: elt        # Extract, Load, and Transform
+  tasks:
+  - tap-jaffle-shop target-duckdb
+  - dbt-duckdb:run
+  - dbt-duckdb:test
+- name: bi         # Launch the Evidence BI dev environment
+  tasks:
+  - evidence:dev
+- name: bi-compile # Build BI reports and test for breakages
+  tasks:
+  - evidence:build-strict
+- name: full-build # End-to-end build and test
+  tasks:
+  - tap-jaffle-shop target-duckdb
+  - dbt-duckdb:run
+  - dbt-duckdb:test
+  - evidence:build-strict
 
 default_environment: dev
 environments:
-  - name: dev
+- name: dev
+  env:
+    JAFFLE_DB_PATH: ${MELTANO_PROJECT_ROOT}/reports/jaffle_shop.${MELTANO_ENVIRONMENT}-duckdb
+    JAFFLE_DB_NAME: jaffle_shop
+    JAFFLE_RAW_SCHEMA: tap_jaffle_shop
+    TAP_JAFFLE_SHOP_YEARS: '1'
+- name: staging
+  env:
+    JAFFLE_DB_PATH: ${MELTANO_PROJECT_ROOT}/reports/jaffle_shop.${MELTANO_ENVIRONMENT}-duckdb
+    JAFFLE_DB_NAME: jaffle_shop
+    JAFFLE_RAW_SCHEMA: tap_jaffle_shop
+    TAP_JAFFLE_SHOP_YEARS: '3'
+- name: prod
+  env:
+    JAFFLE_DB_PATH: ${MELTANO_PROJECT_ROOT}/reports/jaffle_shop.${MELTANO_ENVIRONMENT}-duckdb
+    JAFFLE_DB_NAME: jaffle_shop
+    JAFFLE_RAW_SCHEMA: tap_jaffle_shop
+    TAP_JAFFLE_SHOP_YEARS: '5'
 
 plugins:
   extractors:
-    - name: tap-jaffle-shop
-      namespace: tap_jaffle_shop
-      variant: meltanolabs
-      pip_url: git+https://github.com/MeltanoLabs/tap-jaffle-shop.git@v0.3.0
-      capabilities:
-        - catalog
-        - discover
-      config:
-        years: 2
-        stream_name_prefix: ${JAFFLE_RAW_SCHEMA}-raw_
+  - name: tap-jaffle-shop
+    namespace: tap_jaffle_shop
+    variant: meltanolabs
+    pip_url: git+https://github.com/MeltanoLabs/tap-jaffle-shop.git@v0.2.1
+    capabilities:
+    - catalog
+    - discover
+    config:
+      years: 1
+      stream_name_prefix: ${JAFFLE_RAW_SCHEMA}-raw_
   loaders:
-    - name: target-duckdb
-      variant: jwills
-      pip_url: target-duckdb~=0.4
-      config:
-        filepath: ${JAFFLE_DB_PATH}
-        default_target_schema: $JAFFLE_RAW_SCHEMA
-
-jobs:
-  # Sample usage:  `meltano run el`
-  # Equivalent to: `meltano run tap-jaffle-shop target-duckdb`
-  - name: el # Extract and load the raw data
-    tasks:
-      - tap-jaffle-shop target-duckdb
+  - name: target-duckdb
+    variant: jwills
+    pip_url: target-duckdb~=0.4
+    config:
+      filepath: ${JAFFLE_DB_PATH}
+  - name: target-parquet
+    variant: estrategiahq
+    pip_url: git+https://github.com/estrategiahq/target-parquet.git
+  utilities:
+  - name: dbt-duckdb
+    variant: jwills
+    pip_url: dbt-core~=1.4.5 dbt-duckdb~=1.4.0 git+https://github.com/meltano/[email protected]
+    config:
+      project_dir: ${MELTANO_PROJECT_ROOT}
+      profiles_dir: ${MELTANO_PROJECT_ROOT}
+      path: ${JAFFLE_DB_PATH}
+  - name: evidence
+    variant: meltanolabs
+    pip_url: evidence-ext>=0.5
+    commands:
+      dev: dev
+    config:
+      home_dir: ${MELTANO_PROJECT_ROOT}/reports
+      settings:
+        duckdb:
+          # filename: ${MELTANO_PROJECT_ROOT}/reports/${JAFFLE_DB_NAME}.${MELTANO_ENVIRONMENT}.duckdb
+          filename: ${JAFFLE_DB_NAME}.${MELTANO_ENVIRONMENT}-duckdb
+project_id: ff061732-bd27-4021-916f-e8f8b55fcf9d
diff --git a/models/staging/__sources.yml b/models/staging/__sources.yml
@@ -5,7 +5,10 @@ sources:
     schema: "{{ env_var('JAFFLE_RAW_SCHEMA', 'jaffle_raw') }}"
     description: E-commerce data
     meta:
-      external_location: "read_csv_auto('./jaffle-data/{name}.csv', header=1)"
+      # If `$JAFFLE_RAW_SCHEMA` is specified, use the provided raw data. Otherwise, use the csv seed data from the repo.
+      external_location: >-
+        {{ '' if env_var('JAFFLE_RAW_SCHEMA', '') else 'read_csv_auto("./jaffle-data/{name}.csv", header=1)' }}
+
     tables:
       - name: raw_customers
         description: One record per person who has purchased one or more items

diff --git a/packages.yml b/packages.yml
@@ -1,5 +1,5 @@
 packages:
   - package: dbt-labs/metrics
-    version: 1.5.0
+    version: 1.4.0
   - package: dbt-labs/dbt_utils
     version: 1.0.0
diff --git a/plugins/loaders/target-duckdb--jwills.lock b/plugins/loaders/target-duckdb--jwills.lock
@@ -0,0 +1,95 @@
+{
+  "plugin_type": "loaders",
+  "name": "target-duckdb",
+  "namespace": "target_duckdb",
+  "variant": "jwills",
+  "label": "DuckDB",
+  "docs": "https://hub.meltano.com/loaders/target-duckdb--jwills",
+  "repo": "https://github.com/jwills/target-duckdb",
+  "pip_url": "target-duckdb~=0.4",
+  "description": "DuckDB loader",
+  "logo_url": "https://hub.meltano.com/assets/logos/loaders/duckdb.png",
+  "settings_group_validation": [
+    [
+      "filepath",
+      "default_target_schema"
+    ]
+  ],
+  "settings": [
+    {
+      "name": "filepath",
+      "kind": "string",
+      "label": "File Path",
+      "description": "Path to the local DuckDB file.",
+      "placeholder": "/path/to/local/duckdb.file"
+    },
+    {
+      "name": "batch_size_rows",
+      "kind": "integer",
+      "value": 100000,
+      "label": "Batch Size Rows",
+      "description": "Maximum number of rows in each batch. At the end of each batch, the rows in the batch are loaded into DuckDB."
+    },
+    {
+      "name": "flush_all_streams",
+      "kind": "boolean",
+      "value": false,
+      "label": "Flush All Streams",
+      "description": "Flush and load every stream into DuckDB when one batch is full. Warning - This may trigger the COPY command to use files with low number of records."
+    },
+    {
+      "name": "default_target_schema",
+      "kind": "string",
+      "value": "$MELTANO_EXTRACT__LOAD_SCHEMA",
+      "label": "Default Target Schema",
+      "description": "Name of the schema where the tables will be created. If schema_mapping is not defined then every stream sent by the tap is loaded into this schema."
+    },
+    {
+      "name": "schema_mapping",
+      "kind": "object",
+      "label": "schema_mapping",
+      "description": "Useful if you want to load multiple streams from one tap to multiple DuckDB schemas.\n\nIf the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value.\n"
+    },
+    {
+      "name": "add_metadata_columns",
+      "kind": "boolean",
+      "value": false,
+      "label": "Add Metadata Columns",
+      "description": "Metadata columns add extra row level information about data ingestions, (i.e. when was the row read in source, when was inserted or deleted in postgres etc.) Metadata columns are creating automatically by adding extra columns to the tables with a column prefix _SDC_. The column names are following the stitch naming conventions documented at https://www.stitchdata.com/docs/data-structure/integration-schemas#sdc-columns. Enabling metadata columns will flag the deleted rows by setting the _SDC_DELETED_AT metadata column. Without the add_metadata_columns option the deleted rows from singer taps will not be recognisable in DuckDB."
+    },
+    {
+      "name": "hard_delete",
+      "kind": "boolean",
+      "value": false,
+      "label": "Hard Delete",
+      "description": "When hard_delete option is true then DELETE SQL commands will be performed in DuckDB to delete rows in tables. It's achieved by continuously checking the _SDC_DELETED_AT metadata column sent by the singer tap. Due to deleting rows requires metadata columns, hard_delete option automatically enables the add_metadata_columns option as well."
+    },
+    {
+      "name": "data_flattening_max_level",
+      "kind": "integer",
+      "value": 0,
+      "label": "Data Flattening Max Level",
+      "description": "Object type RECORD items from taps can be transformed to flattened columns by creating columns automatically.\n\nWhen value is 0 (default) then flattening functionality is turned off.\n"
+    },
+    {
+      "name": "primary_key_required",
+      "kind": "boolean",
+      "value": true,
+      "label": "Primary Key Required",
+      "description": "Log based and Incremental replications on tables with no Primary Key cause duplicates when merging UPDATE events. When set to true, stop loading data if no Primary Key is defined."
+    },
+    {
+      "name": "validate_records",
+      "kind": "boolean",
+      "value": false,
+      "label": "Validate Records",
+      "description": "Validate every single record message to the corresponding JSON schema. This option is disabled by default and invalid RECORD messages will fail only at load time by DuckDB. Enabling this option will detect invalid records earlier but could cause performance degradation."
+    },
+    {
+      "name": "temp_dir",
+      "kind": "string",
+      "label": "Temporary Directory",
+      "description": "Directory of temporary CSV files with RECORD messages."
+    }
+  ]
+}
diff --git a/plugins/loaders/target-parquet--estrategiahq.lock b/plugins/loaders/target-parquet--estrategiahq.lock
@@ -0,0 +1,47 @@
+{
+  "plugin_type": "loaders",
+  "name": "target-parquet",
+  "namespace": "target_parquet",
+  "variant": "estrategiahq",
+  "label": "Parquet",
+  "docs": "https://hub.meltano.com/loaders/target-parquet--estrategiahq",
+  "repo": "https://github.com/estrategiahq/target-parquet",
+  "pip_url": "git+https://github.com/estrategiahq/target-parquet.git",
+  "description": "Columnar Storage Format",
+  "logo_url": "https://hub.meltano.com/assets/logos/loaders/parquet.png",
+  "settings": [
+    {
+      "name": "disable_collection",
+      "kind": "boolean",
+      "label": "Disable Collection",
+      "description": "A boolean of whether to disable Singer anonymous tracking."
+    },
+    {
+      "name": "logging_level",
+      "label": "Logging Level",
+      "description": "(Default - INFO) The log level. Can also be set using environment variables."
+    },
+    {
+      "name": "destination_path",
+      "label": "Destination Path",
+      "description": "(Default - '.') The path to write files out to."
+    },
+    {
+      "name": "compression_method",
+      "label": "Compression Method",
+      "description": "Compression methods have to be supported by Pyarrow, and currently the compression modes available are - snappy (recommended), zstd, brotli and gzip."
+    },
+    {
+      "name": "streams_in_separate_folder",
+      "kind": "boolean",
+      "label": "Streams In Separate Folder",
+      "description": "(Default - False) The option to create each stream in a different folder, as these are expected to come in different schema."
+    },
+    {
+      "name": "file_size",
+      "kind": "integer",
+      "label": "File Size",
+      "description": "The number of rows to write per file. The default is to write to a single file."
+    }
+  ]
+}