Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial dbt models to support GTFS guidelines checks #1712

Merged
merged 6 commits into from
Aug 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions warehouse/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,5 @@ models:
mart:
transit_database:
schema: mart_transit_database
gtfs_guidelines:
schema: mart_gtfs_guidelines
25 changes: 25 additions & 0 deletions warehouse/macros/define_gtfs_guidelines.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
-- declare checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you think that would be better? I thought that macros were more accessible (I feel like we don't want average dbt users editing dbt_project.yml a lot? And this table is going to have a lot of iteration.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think a macro for a string value is overkill, that's about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline -- my summary is: All the check values will be used at least two places (in their actual staging check table and in the construction of the index), I wanted to make it easy to be able to reference these hard-coded values if we do want to use them other places / make it so we can update one location and be confident it will propagate

{% macro static_feed_downloaded_successfully() %}
"Static GTFS feed downloads successfully"
{% endmacro %}

{% macro no_validation_errors_in_last_30_days() %}
"No validation errors in last 30 days"
{% endmacro %}

-- declare features
{% macro compliant_on_the_map() %}
"Compliance"
{% endmacro %}


-- columns
{% macro gtfs_guidelines_columns() %}
date,
calitp_itp_id,
calitp_url_number,
calitp_agency_name,
check,
status,
feature
{% endmacro %}
48 changes: 48 additions & 0 deletions warehouse/models/mart/gtfs_guidelines/_gtfs_guidelines.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
version: 2

models:
- name: fact_daily_guideline_checks
description: |
Each row represents a date/guideline check/feed combination, with pass/fail information
indicating whether that feed complied with that check on that date.

Note that this table is only partially implemented; can use "SELECT DISTINCT check"
to see the list of checks that are evaluated herein.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- date
- calitp_itp_id
- calitp_url_number
- check
- dbt_utils.equal_rowcount:
compare_model: ref('stg_gtfs_guidelines__feed_guideline_index')
columns:
- name: date
description: Date on which the check is being evaluated.
- name: calitp_itp_id
description: '{{ doc("column_calitp_itp_id") }}'
meta:
metabase.semantic_type: type/FK
- name: calitp_url_number
description: '{{ doc("column_calitp_url_number") }}'
meta:
metabase.semantic_type: type/FK
- name: calitp_agency_name
description: Human readable agency name, provided for convenience.
meta:
metabase.semantic_type: type/Title
- name: check
description: |
A string description of the GTFS guideline check being performed. For example,
"Static GTFS feed downloads successfully".
- name: status
description: |
Either "PASS" or "FAIL", indicating check status on the given date for the
given feed.
tests:
- not_null
- name: feature
description: |
A string label for the GTFS "feature" associated with the given check. For example,
"Compliant / On the Map".
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{{ config(materialized='table') }}

-- start query
WITH stg_gtfs_guidelines__schedule_downloaded_successfully AS (
SELECT * FROM {{ ref('stg_gtfs_guidelines__schedule_downloaded_successfully') }}
),

stg_gtfs_guidelines__no_validation_errors_in_last_30_days AS (
SELECT * FROM {{ ref('stg_gtfs_guidelines__no_validation_errors_in_last_30_days') }}
),

fact_daily_guideline_checks AS (
SELECT
{{ gtfs_guidelines_columns() }}
FROM stg_gtfs_guidelines__schedule_downloaded_successfully
UNION ALL
SELECT
{{ gtfs_guidelines_columns() }}
FROM stg_gtfs_guidelines__no_validation_errors_in_last_30_days
)

SELECT * FROM fact_daily_guideline_checks
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{{ config(materialized='table') }}

WITH gtfs_schedule_fact_daily_feeds AS (
SELECT * FROM {{ ref('gtfs_schedule_fact_daily_feeds') }}
),

gtfs_schedule_dim_feeds AS (
SELECT * FROM {{ ref('gtfs_schedule_dim_feeds') }}
),

-- list all the checks that have been implemented
checks_implemented AS (
SELECT {{ static_feed_downloaded_successfully() }} AS check, {{ compliant_on_the_map() }} AS feature
UNION ALL
SELECT {{ no_validation_errors_in_last_30_days() }}, {{ compliant_on_the_map() }}
),

-- create an index: all feed/date/check combinations
stg_gtfs_guidelines__feed_check_index AS (
SELECT
t2.calitp_itp_id,
t2.calitp_url_number,
t2.calitp_agency_name,
t1.date,
t1.feed_key,
t3.check,
t3.feature
FROM gtfs_schedule_fact_daily_feeds AS t1
LEFT JOIN gtfs_schedule_dim_feeds AS t2
USING (feed_key)
CROSS JOIN checks_implemented AS t3
)

SELECT * FROM stg_gtfs_guidelines__feed_check_index
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
WITH feed_guideline_index AS (
SELECT * FROM {{ ref('stg_gtfs_guidelines__feed_guideline_index') }}
WHERE check = {{ no_validation_errors_in_last_30_days() }}
),

validation_fact_daily_feed_codes AS (
SELECT * FROM {{ ref('validation_fact_daily_feed_codes') }}
),

validation_dim_codes AS (
SELECT * FROM {{ ref('validation_dim_codes') }}
),

validation_errors_by_day AS (
SELECT
feed_key,
date,
SUM(n_notices) as validation_errors
FROM validation_fact_daily_feed_codes
LEFT JOIN validation_dim_codes USING(code)
WHERE severity = "ERROR"
GROUP BY feed_key, date
),

validation_errors_in_last_30_days_check AS (
SELECT
date,
calitp_itp_id,
calitp_url_number,
calitp_agency_name,
check,
feature,
SUM(validation_errors)
OVER (
PARTITION BY
calitp_itp_id,
calitp_url_number
ORDER BY date
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
) AS errors_last_30_days
FROM feed_guideline_index
LEFT JOIN validation_errors_by_day USING (feed_key, date)
),

validation_errors_in_last_30_days_idx AS (
SELECT
date,
calitp_itp_id,
calitp_url_number,
calitp_agency_name,
check,
CASE
WHEN errors_last_30_days > 0 THEN "FAIL"
WHEN errors_last_30_days = 0 THEN "PASS"
END AS status,
feature
FROM validation_errors_in_last_30_days_check
)

SELECT * FROM validation_errors_in_last_30_days_idx
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
WITH feed_guideline_index AS (
SELECT * FROM {{ ref('stg_gtfs_guidelines__feed_guideline_index') }}
WHERE check = {{ static_feed_downloaded_successfully() }}
),

gtfs_schedule_fact_daily_feeds AS (
SELECT * FROM {{ ref('gtfs_schedule_fact_daily_feeds') }}
),

static_feed_downloaded_successfully_check AS (
SELECT
feed_key,
date,
CASE
WHEN extraction_status = "success" THEN "PASS"
WHEN extraction_status = "error" THEN "FAIL"
ELSE null
lauriemerrell marked this conversation as resolved.
Show resolved Hide resolved
END AS status,
{{ static_feed_downloaded_successfully() }} AS check
FROM gtfs_schedule_fact_daily_feeds
),

static_feed_downloaded_successfully_check_idx AS (
SELECT
t1.date,
t1.calitp_itp_id,
t1.calitp_url_number,
t1.calitp_agency_name,
t1.check,
t2.status,
t1.feature,
FROM feed_guideline_index AS t1
LEFT JOIN static_feed_downloaded_successfully_check AS t2
USING (feed_key, date, check)
)

SELECT * FROM static_feed_downloaded_successfully_check_idx
1 change: 1 addition & 0 deletions warehouse/scripts/run_and_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ def get_command(*args) -> List[str]:
("views", "Data Marts (formerly Warehouse Views)"),
("gtfs_schedule", "GTFS Schedule Feeds Latest"),
("mart_transit_database", "Data Marts (formerly Warehouse Views)"),
("mart_gtfs_guidelines", "Data Marts (formerly Warehouse Views)"),
]:
subprocess.run(
[
Expand Down