Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run CKAN weekly, with multipart uploads as needed #1710

Merged
merged 12 commits into from
Aug 29, 2022
2 changes: 1 addition & 1 deletion airflow/dags/macros.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,6 @@ def prefix_bucket(bucket):
"sql_airtable_mapping": airtable_mapping_generate_sql,
"is_development": is_development_macro,
"image_tag": lambda: "development" if is_development() else "latest",
"env_var": lambda key: os.getenv(key),
"env_var": os.getenv,
"prefix_bucket": prefix_bucket,
}
19 changes: 19 additions & 0 deletions airflow/dags/publish_open_data/METADATA.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
description: "Publishes data to various open data portals"
schedule_interval: "0 0 * * 1"
tags:
- all_gusty_features
default_args:
owner: airflow
depends_on_past: False
start_date: !days_ago 1
email:
- "[email protected]"
- "[email protected]"
- "[email protected]"
email_on_failure: True
email_on_retry: False
retries: 1
retry_delay: !timedelta 'minutes: 2'
concurrency: 50
#sla: !timedelta 'hours: 2'
latest_only: True
51 changes: 51 additions & 0 deletions airflow/dags/publish_open_data/publish_california_open_data.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
operator: 'operators.PodOperator'
name: 'publish-california-open-data'
image: 'ghcr.io/cal-itp/data-infra/warehouse:{{ image_tag() }}'

cmds:
- python3
arguments:
- '/app/scripts/publish.py'
- 'publish-exposure'
- 'california_open_data'
- '{% if is_development() %}--no-publish{% else %}--publish{% endif %}'
- '--bucket'
- "{{ env_var('CALITP_BUCKET__PUBLISH') }}"
- '--manifest'
- "{{ env_var('CALITP_BUCKET__DBT_ARTIFACTS') }}/latest/manifest.json"

is_delete_operator_pod: true
get_logs: true
is_gke: true
pod_location: us-west1
cluster_name: data-infra-apps
namespace: airflow-jobs

env_vars:
GOOGLE_APPLICATION_CREDENTIALS: /secrets/jobs-data/service_account.json

secrets:
- deploy_type: volume
deploy_target: /secrets/jobs-data/
secret: jobs-data
key: service-account.json

resources:
request_memory: 2.0Gi
request_cpu: 1

tolerations:
- key: pod-role
operator: Equal
value: computetask
effect: NoSchedule

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: pod-role
operator: In
values:
- computetask
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@ cluster_name: data-infra-apps
namespace: airflow-jobs

env_vars:
CALITP_BUCKET__DBT_ARTIFACTS: "{{ env_var('CALITP_BUCKET__DBT_ARTIFACTS') }}"
BIGQUERY_KEYFILE_LOCATION: /secrets/jobs-data/service_account.json
DBT_PROJECT_DIR: /app
DBT_PROFILE_DIR: /app
DBT_TARGET: prod_service_account
DBT_TARGET: "{{ env_var('DBT_TARGET') }}"
NETLIFY_SITE_ID: cal-itp-dbt-docs

secrets:
- deploy_type: volume
deploy_target: /secrets/jobs-data/
Expand All @@ -42,11 +44,16 @@ secrets:
secret: jobs-data
key: netlify-auth-token

resources:
request_memory: 2.0Gi
request_cpu: 1

tolerations:
- key: pod-role
operator: Equal
value: computetask
effect: NoSchedule

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
Expand Down
8 changes: 7 additions & 1 deletion airflow/dags/transform_warehouse/dbt_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,24 @@ env_vars:
BIGQUERY_KEYFILE_LOCATION: /secrets/jobs-data/service_account.json
DBT_PROJECT_DIR: /app
DBT_PROFILE_DIR: /app
DBT_TARGET: prod_service_account
DBT_TARGET: "{{ env_var('DBT_TARGET') }}"

secrets:
- deploy_type: volume
deploy_target: /secrets/jobs-data/
secret: jobs-data
key: service-account.json

resources:
request_memory: 2.0Gi
request_cpu: 1

tolerations:
- key: pod-role
operator: Equal
value: computetask
effect: NoSchedule

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ env_vars:
BIGQUERY_KEYFILE_LOCATION: /secrets/jobs-data/service_account.json
DBT_PROJECT_DIR: /app
DBT_PROFILE_DIR: /app
DBT_TARGET: prod_service_account
DBT_TARGET: "{{ env_var('DBT_TARGET') }}"
NETLIFY_SITE_ID: cal-itp-dbt-docs
secrets:
- deploy_type: volume
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ env_vars:
BIGQUERY_KEYFILE_LOCATION: /secrets/jobs-data/service_account.json
DBT_PROJECT_DIR: /app
DBT_PROFILE_DIR: /app
DBT_TARGET: prod_service_account
DBT_TARGET: "{{ env_var('DBT_TARGET') }}"
secrets:
- deploy_type: volume
deploy_target: /secrets/jobs-data/
Expand Down
4 changes: 4 additions & 0 deletions airflow/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,12 +80,16 @@ x-airflow-common:
GOOGLE_CLOUD_PROJECT: cal-itp-data-infra

CALITP_BUCKET__AIRTABLE: "gs://test-calitp-airtable"
CALITP_BUCKET__DBT_ARTIFACTS: "gs://test-calitp-dbt-artifacts"
CALITP_BUCKET__GTFS_RT_RAW: "gs://test-calitp-gtfs-rt-raw"
CALITP_BUCKET__GTFS_RT_PARSED: "gs://test-calitp-gtfs-rt-parsed"
CALITP_BUCKET__GTFS_RT_VALIDATION: "gs://test-calitp-gtfs-rt-validation"
CALITP_BUCKET__GTFS_SCHEDULE_RAW: "gs://test-calitp-gtfs-schedule-raw"
CALITP_BUCKET__GTFS_SCHEDULE_VALIDATION: "gs://test-calitp-gtfs-schedule-validation"
CALITP_BUCKET__GTFS_SCHEDULE_UNZIPPED: "gs://test-calitp-gtfs-schedule-unzipped"
CALITP_BUCKET__PUBLISH: "gs://test-calitp-publish"

DBT_TARGET: staging_service_account

# TODO: this can be removed once we've confirmed it's no longer in Airtable
GRAAS_SERVER_URL: $GRAAS_SERVER_URL
Expand Down
25 changes: 15 additions & 10 deletions docs/publishing/sections/8_ckan.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,20 @@ update the `meta` field to map the dbt models to the appropriate UUIDs.

An example from the latest-only GTFS data exposure.
```yaml
meta:
destinations:
- type: ckan
bucket: gs://calitp-publish
format: csv
url: https://data.ca.gov/api/3/action/resource_update
ids:
agency: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
routes: c6bbb637-988f-431c-8444-aef7277297f8
meta:
methodology: |
Cal-ITP collects the GTFS feeds from a statewide list [link] every night and aggegrates it into a statewide table
for analysis purposes only. Do not use for trip planner ingestation, rather is meant to be used for statewide
analytics and other use cases. Note: These data may or may or may not have passed GTFS-Validation.
coordinate_system_espg: "EPSG:4326"
destinations:
- type: ckan
bucket: gs://calitp-publish
format: csv
url: https://data.ca.gov
ids:
agency: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
routes: c6bbb637-988f-431c-8444-aef7277297f8
```

### Publish the data!
Expand All @@ -79,7 +84,7 @@ poetry run python scripts/publish.py publish-exposure california_open_data --dry

Example production deployment:
```bash
poetry run python scripts/publish.py publish-exposure california_open_data --project=cal-itp-data-infra --bucket="gs://calitp-publish" --deploy
poetry run python scripts/publish.py publish-exposure california_open_data --project=cal-itp-data-infra --bucket="gs://calitp-publish" --publish
```


Expand Down
5 changes: 5 additions & 0 deletions warehouse/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.user.yml

target/
dbt_packages/
logs/
Original file line number Diff line number Diff line change
Expand Up @@ -756,12 +756,11 @@ exposures:
- type: ckan
bucket: gs://calitp-publish
format: csv
url: https://data.ca.gov/api/3/action/resource_update
url: https://data.ca.gov
ids:
agency: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
routes: c6bbb637-988f-431c-8444-aef7277297f8
# TODO: add stop_times back in once size limit lifted
# stop_times: d31eef2f-e223-4ca4-a86b-170acc6b2590
stop_times: d31eef2f-e223-4ca4-a86b-170acc6b2590
stops: 8c876204-e12b-48a2-8299-10f6ae3d4f2b
trips: 0e4da89e-9330-43f8-8de9-305cb7d4918f
attributions: 038b7354-06e8-4082-a4a1-40debd3110d5
Expand All @@ -773,7 +772,6 @@ exposures:
frequencies: 48542c8f-8ce1-43e3-a965-6c68771d6fe5
levels: 288a08cd-7929-479e-aa88-08b677a08510
pathways: a01484af-c460-40a4-ac8a-896b0196e8c2
# TODO: add shapes when the size limit has been lifted
# shapes: 2f5e7bdb-33e8-4633-b163-6bab42ad0951
shapes: 2f5e7bdb-33e8-4633-b163-6bab42ad0951
transfers: f8dcda5d-0c6d-4c70-b5f5-6716adcf6ffc
translations: 7abe9256-6cd2-4c1f-9b6a-72108022a382
Loading