Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit schedule validation jobs with a pool #1700

Merged
merged 1 commit into from
Aug 22, 2022
Merged

limit schedule validation jobs with a pool #1700

merged 1 commit into from
Aug 22, 2022

Conversation

atvaccaro
Copy link
Contributor

Description

I forgot to add a pool to control how many schedule pipeline jobs can run at once during the backfill.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation
  • agencies.yml

How has this been tested?

Screenshots (optional)

image

@atvaccaro atvaccaro self-assigned this Aug 22, 2022
@atvaccaro atvaccaro requested a review from evansiroky as a code owner August 22, 2022 16:56
@atvaccaro atvaccaro merged commit 3f05261 into main Aug 22, 2022
@atvaccaro atvaccaro deleted the schedule-pool branch August 22, 2022 17:04
lottspot added a commit that referenced this pull request Sep 6, 2022
* airtable: start renaming int to base

* airtable: refactor staging tables to be historical; refactor get latest macro to enable daily extract selection

* airtable: convert staging to views rather than tables

* airtable: convert intermediate mapping tables to base

* always compile, but only check dbt run success after docs/metabase

* run tests even if run failed

* airtable: define key as metabase PK

* airtable: add equal row count tests for models with id mapping

* airtable: rename map to bridge

* update poetry.lock for dbt-metabase

* airtable: latest-only-ify bridge tables

* missed a couple

* airtable: make mart latest-only

* airtable: refactor dim service components

* airtable: specify metabase FK columns

* airtable: new fields & tables to address #1630

* airtable: make bridge tables date-aware and assorted small fixes

* get us going!

* airtable: address failing dbt tests -- minor tweaks

* airtable: more failing dbt tests

* airtable: refactor service components to handle duplicates

* airtable: fix legacy airtable source definition to reference views

* airtable: remove redundant metabase FK metadata

* airtable: fix test syntax

* airtable: use QUALIFY to simplify ranked queries

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests

* use new calitp version

* fix(timestamp partitions): explicitly use isoformat string

* style: rename CTEs to be more specific

* farm surrogate key macro: coalesce nulls in macro itself

* add notebook used to re-name a partition

* chore: remove pyup config file

no longer in use

* chore: remove pyup ignore statement

* airtable: use ts instead of time

* add airtable mart to list of things synced to metabase

* update metabase database names again

* warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables

* warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views

* warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse

* warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references

* print message if deploy is not set

* round lat/lons, specify 4m accuracy, add new resources

* print the documentation file being written

* add coord system, disable shapes for now due to size limit

* fix(fact daily trips timeout): wip incremental table

* update to good stable version of sqlfluff

* fix: make fact daily trips incremental -- WIP

* pass and/or ignore new rules

* linter

* fact daily trips: remove dev incremental check

* docs: update airtable prod maintenance instructions

* docs: add new dags to dependency diagram

* docs: add spacing to help w line wrapping

* docs: more spaces for line wrapping...

* dbt-metabase: update version in poetry; comment out failing relationship tests

* warehouse(payments_views): got payments_rides working and migrated, added yml and metadata,  added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work

* get new calitp version

* import gcs models from calitp-py!

* missed a couple

* get us going!

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* fix(timestamp partitions): explicitly use isoformat string

* use new calitp version

* start experimenting with task queue options and metrics

* get this working and test performance with greenlets

* couple more metrics

* wip testing with multiple consumers at high volume

* start optimizing for lots of small tasks; have to make redis interaction fast

* fix key str format

* couple more libs

* wip

* wip on discussed changes

* get the keys from environ for now

* use new calitp py

* print a bit more

* we are just gonna get stuff in the env

* commit this before I break anything

* fmt

* bump calitp-py

* lint

* rename v2 to v3 since 2.X tags already exist

* kinda make this runnable

* new node pool just dropped

* get running in docker compose to kick the tires

* start on RT v3 k8s

* get the consumer working mostly?

* label redis pod appropriately

* tell consumer about temp rt secrets

* that was dumb

* ticker k8s!

* set expire time on the huey instance

* point consumer at svc account json

* avoid pulling the stacktrace in

* scrape on 9102

* bump to 16 workers per consumer

* bump jupyterhub storage to 32gi

* add these back!

* add comment

* bring in new calitp and fix tick rounding

* improve metrics and labels

* warehouse(payments): removed payemnts_rides_refactor from yml file

* clean up labels

* get secrets from secret manager sdk before the consumer starts...

* missed this

* fix secrets volume and adjust affinities

* warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables

* warehouse(payments): removed the old intermediate tables from the dbt project yaml file

* add content type header to bytes

* ugh whitespace

* warehouse: fixing linting error

* warehouse: fixing linting error again

* warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future

* fix: update Mountain Transit URL

* remove celery and gevent from pyproject deps

Co-authored-by: Mjumbe Poe <[email protected]>

* we might as well specify huey app name by env as well just in case we end up on the same redis in the future

* write to the prod bucket!

* create a preprod version and deploy it

* run fewer workers in preprod

* move pull policies to patches, and only run 1 dev consumer

* add redis considerations to readme

* docs(datasets and tables): revised informationon dbt docs for views tables based on PR review

* docs(datasets and tables): revised for readability

* docs(datasets and tables): revised docs information for gtfs schedule based on PR review

* docs(datasets and tables): fixed readability

* docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions

* docs(datasets and tables): revamped the overview page for datasets and tables

* docs(datasets and tables): cleaned up readability

* bump version and start adding more logging context

* specifically log request errors that do not come from raise_for_status

* set v3 image versions separately

* bump to 8 workers and improve log formatting

* formatting

* fix string representation of exception type in logs

* bump prod to 3.1

* oops

* hotfix version

* bump to 30m

* warehouse(airflow): deleted the empty payments_views_staging dag directory

* warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task

* docs(airflow): edited the production dags docs to reflect changes in payments staging views dags

* docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies

* Update new-team-member.md

Fixed added missing meetings, deleted old meetings. deleted auto-assign

* docs(datasets ans tables): reconfigured some pages for readability

* docs(datasets and tables): re-reviewed and added clarity

* fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number

* docs(production maintenance): added n/a for dependencies for payments_views

* docs(datasets and tables): created new page with content on how to use dbt docs, added to toc

* docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections

* (analyst_docs): update gcloud commands

* fix(open data): make test_metadata attribute optional to account for singular tests

* docs(datasets and tables): reformatted for readability and conciseness

* docs(datasets and tables): revisions based on Laurie's review

* docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc

* fix(open data): suppress publishing stop_times because of size limit issue

* agencies.yml: update FCRTA and add Escalon Transit

* agencies.yml: rename escalon transit to etrans

* fix(airflow/gtfs_loader): replace non-utf-8 characters

* feat(airtable): add new columns per request #1674

* fix(airtable data): address review comments PR #1677

* fix: add WeHo RT URLs

* fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set

* update calitp py and change log

* make docker compose work

* specify buckets and bump version in dev

* now do prod

* change logging

* add weho key

* bump gtfs rt v3 version

* bump calitp py

* deploy new image to dev

* get dev and prod working with bucket env vars

* bump calitp py and expire cache every 5 minutes

* deploy new cache clearing to prod/dev

* make sure calitp is updated, load secrets in ticker too

* fix docker compose, use new flags, deploy new image to dev

* bump prod

* add airtable age metric, bump version, scrape ticker

* delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681)

* docs: correct Transit Technology Stacks title (#1565)

The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that.

* fix: update GRaaS URLs (#1690)

* New schedule pipeline validation job (#1648)

* wip on validation in new schedule pipeline

* bring in stuff from calitp storage, work on saving validations/outcomes

* wip getting this working

* use new calitp prerelease, fix filenames/content, remove break

* oops

* working!

* update lockfile

* unzip/validate schedule dag

* remove this

* bring in latest calitp-py

* extra print

* pass env vars into pod

* fix lint

* add readme

* bring in latest calitp

* fix print and formatting

* bring the outcome-only classes over, and use env var for bucket

* filter out nones for RT airtable records

* bring in latest calitp py

* get latest calitp

* use new env var and rename validation job results

* start updating airflow with new calitp py and using bucket env vars

* test schedule downloader with new calitp

* new calitp

* handle new calitp, better logging

* add env vars for new calitp

* put prefix_bucket back for parse_and_validate_rt and document env var configuration

* comments

* use new version of caltip py with good gcsfs (#1693)

* use new version of caltip py with good gcsfs

* use the regular release

* docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430)

* docs(agency): adding reference table for analysts to define agency in their research

* docs(agency): fixed table formatting error

* docs(agency): fixed table formatting error plus pre-commit hooks

* docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks

* docs: formatting errors, added missing capitalization

* docs: formatting table with list

* docs: formatting table with no line break - attempt 1

* docs: clarified language and spacing in table

* docs: clarified language in table

* docs: removing extra information from agency table

* docs: removing extra information from agency table pt 2

* docs: removing extra information from agency table pt 3

* docs: reworked table to include gtfs-provider-service relationships

* docs: added space for the gtfs provider's services section

* docs: added space for the gtfs provider's services section syntax corrections

* docs: added space for the gtfs provider's services section syntax corrections again

* docs: clarified information arounf gtfs provider relationships

* docs: clarified information around gtfs provider relationships and intro content

* docs: agency table revisions based on call with E

* docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table

* docs(agency reference): reformatted table

* docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling

* docs: renamed python library file h1 to be more intuitive

* docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables

* docs(add to what_is_agency)

* docs(warehouse): fixed some typos, errors, and formatting issues

Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>

* we also have to pin a specific fsspec version directly in the requirements (#1694)

* Create SFTP ingest component for Elavon data (#1692)

* kubernetes: sftp-ingest-elavon: add server component

* kubernetes: sftp-server: add sshd configuration

This enables functionality like chroot'd logins and disabling of shell
logins.

* kubernetes: sftp-server: add readinessProbe

Since the container is essentially built at startup, there is a sizeable
time delta between container startup and ssh server startup. This
addition helps the operator easily detect when installation is complete
and the service is running.

* kubernetes: sftp-server: add cluster service

This enables cluster workloads to login using a DNS names.

* kubernetes: sftp-server: refactor bootstrap script for better DRY

* kubernetes: prod-sftp-ingest-elavon: create production localization

* kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml

This exposes the SFTP port for inbound connections from the vendor.

* ci: prod-sftp-ingest-elavon.env: enable prod deployment

* Fix typo in `what is agency` (#1698)

it's --> it's

* limit schedule validation jobs with a pool (#1700)

* Created new row-level access policy macro and applied it to payments_rides (#1697)

* created new row-level access policy and applied it to payments rides with newly generated service accounts

* ran pre-commit hooks to fix failing actions

Co-authored-by: Charles Costanzo <[email protected]>

* deploy voila fix (#1702)

* disable autodetect if schema is specified (#1704)

* Create v2 RT parsing and validation jobs in Airflow and creates external tables (#1691)

* start on new parsing job

* comment and fmt

* wip getting parsing working

* fmt

* get parsing working!

* save outcomes file properly

* remove old validator and dupe log

* this is only jsonl right now, so this workaround is bad

* wip on validation

* wip

* get parsing working, start simplifying

* get validation working with schedules referenced by airtable!

* missed this

* get the actual rt v2 airflow jobs mostly working

* missed this

* run v2 RT jobs at :15 instead of :30

* convert metadata field names to bq-safe

* fix being able to template bucket and test out rt_service_alerts_v2 external table

* add outcomes external table to test

* wip trying to get a debugger to test pydantic custom serialization

* fix rt outcome serialization to be bq safe

* create rest of rt v2 external tables

* couple small fixes

* start addressing PR comments

* address PR comment

* add ci/cd action to build gtfs-rt-parser-v2 image

* Fix: skip amplitude_benefits DAG if 404 (#1705)

* fix(amplitude): mark skip when 404 is encountered

* chore(amplitude): add some logging statements around API call

* Gtfs schedule unzip v2 (#1696)

* gtfs loader v2 wip

* gtfs unzipper v2: semi-working WIP -- can unzip at least one zipfile

* address initial review comments

* bump calitp version

* gtfs unzipper v2: working version with required functionality

* update calitp and make the downloader run with it

* gtfs unzipper v2: get working in airflow; use logging

* rename to distinguish zipfile from extracted files within zipfile

* resolve reviewer comments

* gtfs unzipper v2: refactor to raise exceptions on unparseable zips

* gtfs unzipper: further simplify exception handling

* final tweaks -- refactor of checking for invalid zip structure, tighten up processing of valid files

* comment typos/clarifications

Co-authored-by: Andrew Vaccaro <[email protected]>

* warehouse: added fare_systems transit database mart table (#1701)

* warehouse: added fare_systems transit database mart table

* warehouse: fixed duplicate doc macro issue for fare_systems

* explicitly declared schema

* removed columns no longer relevant

* warehouse: added bridge table for fare_systems x services

* warehouse: added bridge table for fare_systems x services to yaml

* Clean up RT outcomes (#1709)

* remove unnecessary json_encoders

* just save the extract path

* add a dockerignore

* grant access to payments_rides for non-agency users (#1714)

* grant access to payments_rides for non-agency users

* just use calitp domain and add a couple other users

* Run CKAN weekly, with multipart uploads as needed (#1710)

* wip getting multipart upload to ckan working

* remove before I forget

* mirror the example script... we get 500s with too many chunks

* commit this while it is working

* add this back

* allow env var to control target and bucket

* create weekly task to run publish california_open_data

* allow manifest to be in gcs

* get this actually working...

* dockerignore

* clean up names, add resource requests, make work in pod operator

* address PR comments

* load this from a secret (#1717)

* Initial dbt models to support GTFS guidelines checks (#1712)

* initial work towards #1688

* gtfs guidelines initial implementation: tweaks & improvements

* gtfs guidelines: add metabase semantic type for calitp agency name

* sync new dataset to metabase

* gtfs guidelines: rename table, formatting updates

* rename compliance gtfs feature per PR review

* Add RT VP vs Sched Table (#1708)

* add table

* add table

* add operator

* fix sql syntax

* fix failing indentations

* add unique test

* fix .yml test

* Create local Dockerfile and bash script for dbt development (#1711)

* start on local dev dockerfile

* handle local profiles dir

* make dbt docker work with local google credentials

* add build-essentials per recommendation

* update poetry install method and add libgdal-dev

* poetry changed its bin location

* Improvements to dbt artifacts and publish workflow (#1726)

* add ts partition to publish artifacts

* also save artifacts with timestamps vs just latest

* start simplifying publish script, proper dry runs, reading manifest from gcs

* fix publish assert, use env vars, simplify logging

* allow resource descriptions in publishing, allow direct remote writing

* ugh

* need to be utc

* bring in simplified descriptions

* missed bucket

* upload metadata/dictionary to gcs for ckan; also fix bug

* update ckan docs to reflect publishing changes

* actually these should always get written

* env vars not templating

* fix timestamped artifact names

* pretty print

* address pr comments

* update ckan publishing docs

* actually set ckan precision fields and use them

* uppercase field types and allow specifying a model to publish

* bad dict key

* these are length 7

* lats are only 6 digits

* warehouse documentation: add calitp_itp_id and calitp_url_number metadata to several dimensional columns (#1733)

* airtable organizations: define external table schemas (#1734)

* Upgrade schedule validator and save version as metadata (#1729)

* update to v3 validator, fix dockerfile

* finally deploy the schedule validator image through github actions

* bring in latest calitp

* use new calitp, simplify metadata, add version to notice rows, couple qol improvements

* change flag per v3

* use poetry export install here too

* lock

* export install here too

* add verbose, just copy jar instead of download

* use environ directly

* Set RT validator version as metadata and fix a bug (#1732)

* set rt validator version as metadata

* add validator version in metadata and put extract under a key

* fix schedule data exception string representation and assert after outcomes upload

* fix poetry in docker, lock

* use export and install

* update typer

* fix schedule downloading... also add url filter to cli

* get latest validator from github just in case, and keep name

* rename this here too

* address PR comments

* add pool for airtable (#1743)

* deprecate airtable v1 extracts (#1699)

* deprecate airtable v1 extracts

* delete v1 airtable operator

* Change column name to fix run error (#1730)

* change date col name

* fix service_date col

* chore: remove evansiroky from most CODEOWNERS items (#1735)

* kubernetes: prod-sftp-ingest-elavon: add elavon ssh public key (#1742)

Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Angela Tran <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az173-876>
lottspot added a commit that referenced this pull request Sep 7, 2022
* airtable: convert intermediate mapping tables to base

* always compile, but only check dbt run success after docs/metabase

* run tests even if run failed

* airtable: define key as metabase PK

* airtable: add equal row count tests for models with id mapping

* airtable: rename map to bridge

* update poetry.lock for dbt-metabase

* airtable: latest-only-ify bridge tables

* missed a couple

* airtable: make mart latest-only

* airtable: refactor dim service components

* airtable: specify metabase FK columns

* airtable: new fields & tables to address #1630

* airtable: make bridge tables date-aware and assorted small fixes

* get us going!

* airtable: address failing dbt tests -- minor tweaks

* airtable: more failing dbt tests

* airtable: refactor service components to handle duplicates

* airtable: fix legacy airtable source definition to reference views

* airtable: remove redundant metabase FK metadata

* airtable: fix test syntax

* airtable: use QUALIFY to simplify ranked queries

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests

* use new calitp version

* fix(timestamp partitions): explicitly use isoformat string

* style: rename CTEs to be more specific

* farm surrogate key macro: coalesce nulls in macro itself

* add notebook used to re-name a partition

* chore: remove pyup config file

no longer in use

* chore: remove pyup ignore statement

* airtable: use ts instead of time

* add airtable mart to list of things synced to metabase

* update metabase database names again

* warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables

* warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views

* warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse

* warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references

* print message if deploy is not set

* round lat/lons, specify 4m accuracy, add new resources

* print the documentation file being written

* add coord system, disable shapes for now due to size limit

* fix(fact daily trips timeout): wip incremental table

* update to good stable version of sqlfluff

* fix: make fact daily trips incremental -- WIP

* pass and/or ignore new rules

* linter

* fact daily trips: remove dev incremental check

* docs: update airtable prod maintenance instructions

* docs: add new dags to dependency diagram

* docs: add spacing to help w line wrapping

* docs: more spaces for line wrapping...

* dbt-metabase: update version in poetry; comment out failing relationship tests

* warehouse(payments_views): got payments_rides working and migrated, added yml and metadata,  added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work

* get new calitp version

* import gcs models from calitp-py!

* missed a couple

* get us going!

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* fix(timestamp partitions): explicitly use isoformat string

* use new calitp version

* start experimenting with task queue options and metrics

* get this working and test performance with greenlets

* couple more metrics

* wip testing with multiple consumers at high volume

* start optimizing for lots of small tasks; have to make redis interaction fast

* fix key str format

* couple more libs

* wip

* wip on discussed changes

* get the keys from environ for now

* use new calitp py

* print a bit more

* we are just gonna get stuff in the env

* commit this before I break anything

* fmt

* bump calitp-py

* lint

* rename v2 to v3 since 2.X tags already exist

* kinda make this runnable

* new node pool just dropped

* get running in docker compose to kick the tires

* start on RT v3 k8s

* get the consumer working mostly?

* label redis pod appropriately

* tell consumer about temp rt secrets

* that was dumb

* ticker k8s!

* set expire time on the huey instance

* point consumer at svc account json

* avoid pulling the stacktrace in

* scrape on 9102

* bump to 16 workers per consumer

* bump jupyterhub storage to 32gi

* add these back!

* add comment

* bring in new calitp and fix tick rounding

* improve metrics and labels

* warehouse(payments): removed payemnts_rides_refactor from yml file

* clean up labels

* get secrets from secret manager sdk before the consumer starts...

* missed this

* fix secrets volume and adjust affinities

* warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables

* warehouse(payments): removed the old intermediate tables from the dbt project yaml file

* add content type header to bytes

* ugh whitespace

* warehouse: fixing linting error

* warehouse: fixing linting error again

* warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future

* fix: update Mountain Transit URL

* remove celery and gevent from pyproject deps

Co-authored-by: Mjumbe Poe <[email protected]>

* we might as well specify huey app name by env as well just in case we end up on the same redis in the future

* write to the prod bucket!

* create a preprod version and deploy it

* run fewer workers in preprod

* move pull policies to patches, and only run 1 dev consumer

* add redis considerations to readme

* docs(datasets and tables): revised informationon dbt docs for views tables based on PR review

* docs(datasets and tables): revised for readability

* docs(datasets and tables): revised docs information for gtfs schedule based on PR review

* docs(datasets and tables): fixed readability

* docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions

* docs(datasets and tables): revamped the overview page for datasets and tables

* docs(datasets and tables): cleaned up readability

* bump version and start adding more logging context

* specifically log request errors that do not come from raise_for_status

* set v3 image versions separately

* bump to 8 workers and improve log formatting

* formatting

* fix string representation of exception type in logs

* bump prod to 3.1

* oops

* hotfix version

* bump to 30m

* warehouse(airflow): deleted the empty payments_views_staging dag directory

* warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task

* docs(airflow): edited the production dags docs to reflect changes in payments staging views dags

* docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies

* Update new-team-member.md

Fixed added missing meetings, deleted old meetings. deleted auto-assign

* docs(datasets ans tables): reconfigured some pages for readability

* docs(datasets and tables): re-reviewed and added clarity

* fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number

* docs(production maintenance): added n/a for dependencies for payments_views

* docs(datasets and tables): created new page with content on how to use dbt docs, added to toc

* docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections

* (analyst_docs): update gcloud commands

* fix(open data): make test_metadata attribute optional to account for singular tests

* docs(datasets and tables): reformatted for readability and conciseness

* docs(datasets and tables): revisions based on Laurie's review

* docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc

* fix(open data): suppress publishing stop_times because of size limit issue

* agencies.yml: update FCRTA and add Escalon Transit

* agencies.yml: rename escalon transit to etrans

* fix(airflow/gtfs_loader): replace non-utf-8 characters

* feat(airtable): add new columns per request #1674

* fix(airtable data): address review comments PR #1677

* fix: add WeHo RT URLs

* fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set

* update calitp py and change log

* make docker compose work

* specify buckets and bump version in dev

* now do prod

* change logging

* add weho key

* bump gtfs rt v3 version

* bump calitp py

* deploy new image to dev

* get dev and prod working with bucket env vars

* bump calitp py and expire cache every 5 minutes

* deploy new cache clearing to prod/dev

* make sure calitp is updated, load secrets in ticker too

* fix docker compose, use new flags, deploy new image to dev

* bump prod

* add airtable age metric, bump version, scrape ticker

* delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681)

* docs: correct Transit Technology Stacks title (#1565)

The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that.

* fix: update GRaaS URLs (#1690)

* New schedule pipeline validation job (#1648)

* wip on validation in new schedule pipeline

* bring in stuff from calitp storage, work on saving validations/outcomes

* wip getting this working

* use new calitp prerelease, fix filenames/content, remove break

* oops

* working!

* update lockfile

* unzip/validate schedule dag

* remove this

* bring in latest calitp-py

* extra print

* pass env vars into pod

* fix lint

* add readme

* bring in latest calitp

* fix print and formatting

* bring the outcome-only classes over, and use env var for bucket

* filter out nones for RT airtable records

* bring in latest calitp py

* get latest calitp

* use new env var and rename validation job results

* start updating airflow with new calitp py and using bucket env vars

* test schedule downloader with new calitp

* new calitp

* handle new calitp, better logging

* add env vars for new calitp

* put prefix_bucket back for parse_and_validate_rt and document env var configuration

* comments

* use new version of caltip py with good gcsfs (#1693)

* use new version of caltip py with good gcsfs

* use the regular release

* docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430)

* docs(agency): adding reference table for analysts to define agency in their research

* docs(agency): fixed table formatting error

* docs(agency): fixed table formatting error plus pre-commit hooks

* docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks

* docs: formatting errors, added missing capitalization

* docs: formatting table with list

* docs: formatting table with no line break - attempt 1

* docs: clarified language and spacing in table

* docs: clarified language in table

* docs: removing extra information from agency table

* docs: removing extra information from agency table pt 2

* docs: removing extra information from agency table pt 3

* docs: reworked table to include gtfs-provider-service relationships

* docs: added space for the gtfs provider's services section

* docs: added space for the gtfs provider's services section syntax corrections

* docs: added space for the gtfs provider's services section syntax corrections again

* docs: clarified information arounf gtfs provider relationships

* docs: clarified information around gtfs provider relationships and intro content

* docs: agency table revisions based on call with E

* docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table

* docs(agency reference): reformatted table

* docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling

* docs: renamed python library file h1 to be more intuitive

* docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables

* docs(add to what_is_agency)

* docs(warehouse): fixed some typos, errors, and formatting issues

Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>

* we also have to pin a specific fsspec version directly in the requirements (#1694)

* Create SFTP ingest component for Elavon data (#1692)

* kubernetes: sftp-ingest-elavon: add server component

* kubernetes: sftp-server: add sshd configuration

This enables functionality like chroot'd logins and disabling of shell
logins.

* kubernetes: sftp-server: add readinessProbe

Since the container is essentially built at startup, there is a sizeable
time delta between container startup and ssh server startup. This
addition helps the operator easily detect when installation is complete
and the service is running.

* kubernetes: sftp-server: add cluster service

This enables cluster workloads to login using a DNS names.

* kubernetes: sftp-server: refactor bootstrap script for better DRY

* kubernetes: prod-sftp-ingest-elavon: create production localization

* kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml

This exposes the SFTP port for inbound connections from the vendor.

* ci: prod-sftp-ingest-elavon.env: enable prod deployment

* Fix typo in `what is agency` (#1698)

it's --> it's

* limit schedule validation jobs with a pool (#1700)

* Created new row-level access policy macro and applied it to payments_rides (#1697)

* created new row-level access policy and applied it to payments rides with newly generated service accounts

* ran pre-commit hooks to fix failing actions

Co-authored-by: Charles Costanzo <[email protected]>

* deploy voila fix (#1702)

* disable autodetect if schema is specified (#1704)

* Create v2 RT parsing and validation jobs in Airflow and creates external tables (#1691)

* start on new parsing job

* comment and fmt

* wip getting parsing working

* fmt

* get parsing working!

* save outcomes file properly

* remove old validator and dupe log

* this is only jsonl right now, so this workaround is bad

* wip on validation

* wip

* get parsing working, start simplifying

* get validation working with schedules referenced by airtable!

* missed this

* get the actual rt v2 airflow jobs mostly working

* missed this

* run v2 RT jobs at :15 instead of :30

* convert metadata field names to bq-safe

* fix being able to template bucket and test out rt_service_alerts_v2 external table

* add outcomes external table to test

* wip trying to get a debugger to test pydantic custom serialization

* fix rt outcome serialization to be bq safe

* create rest of rt v2 external tables

* couple small fixes

* start addressing PR comments

* address PR comment

* add ci/cd action to build gtfs-rt-parser-v2 image

* Fix: skip amplitude_benefits DAG if 404 (#1705)

* fix(amplitude): mark skip when 404 is encountered

* chore(amplitude): add some logging statements around API call

* Gtfs schedule unzip v2 (#1696)

* gtfs loader v2 wip

* gtfs unzipper v2: semi-working WIP -- can unzip at least one zipfile

* address initial review comments

* bump calitp version

* gtfs unzipper v2: working version with required functionality

* update calitp and make the downloader run with it

* gtfs unzipper v2: get working in airflow; use logging

* rename to distinguish zipfile from extracted files within zipfile

* resolve reviewer comments

* gtfs unzipper v2: refactor to raise exceptions on unparseable zips

* gtfs unzipper: further simplify exception handling

* final tweaks -- refactor of checking for invalid zip structure, tighten up processing of valid files

* comment typos/clarifications

Co-authored-by: Andrew Vaccaro <[email protected]>

* warehouse: added fare_systems transit database mart table (#1701)

* warehouse: added fare_systems transit database mart table

* warehouse: fixed duplicate doc macro issue for fare_systems

* explicitly declared schema

* removed columns no longer relevant

* warehouse: added bridge table for fare_systems x services

* warehouse: added bridge table for fare_systems x services to yaml

* Clean up RT outcomes (#1709)

* remove unnecessary json_encoders

* just save the extract path

* add a dockerignore

* grant access to payments_rides for non-agency users (#1714)

* grant access to payments_rides for non-agency users

* just use calitp domain and add a couple other users

* Run CKAN weekly, with multipart uploads as needed (#1710)

* wip getting multipart upload to ckan working

* remove before I forget

* mirror the example script... we get 500s with too many chunks

* commit this while it is working

* add this back

* allow env var to control target and bucket

* create weekly task to run publish california_open_data

* allow manifest to be in gcs

* get this actually working...

* dockerignore

* clean up names, add resource requests, make work in pod operator

* address PR comments

* load this from a secret (#1717)

* Initial dbt models to support GTFS guidelines checks (#1712)

* initial work towards #1688

* gtfs guidelines initial implementation: tweaks & improvements

* gtfs guidelines: add metabase semantic type for calitp agency name

* sync new dataset to metabase

* gtfs guidelines: rename table, formatting updates

* rename compliance gtfs feature per PR review

* Add RT VP vs Sched Table (#1708)

* add table

* add table

* add operator

* fix sql syntax

* fix failing indentations

* add unique test

* fix .yml test

* Create local Dockerfile and bash script for dbt development (#1711)

* start on local dev dockerfile

* handle local profiles dir

* make dbt docker work with local google credentials

* add build-essentials per recommendation

* update poetry install method and add libgdal-dev

* poetry changed its bin location

* Improvements to dbt artifacts and publish workflow (#1726)

* add ts partition to publish artifacts

* also save artifacts with timestamps vs just latest

* start simplifying publish script, proper dry runs, reading manifest from gcs

* fix publish assert, use env vars, simplify logging

* allow resource descriptions in publishing, allow direct remote writing

* ugh

* need to be utc

* bring in simplified descriptions

* missed bucket

* upload metadata/dictionary to gcs for ckan; also fix bug

* update ckan docs to reflect publishing changes

* actually these should always get written

* env vars not templating

* fix timestamped artifact names

* pretty print

* address pr comments

* update ckan publishing docs

* actually set ckan precision fields and use them

* uppercase field types and allow specifying a model to publish

* bad dict key

* these are length 7

* lats are only 6 digits

* warehouse documentation: add calitp_itp_id and calitp_url_number metadata to several dimensional columns (#1733)

* airtable organizations: define external table schemas (#1734)

* Upgrade schedule validator and save version as metadata (#1729)

* update to v3 validator, fix dockerfile

* finally deploy the schedule validator image through github actions

* bring in latest calitp

* use new calitp, simplify metadata, add version to notice rows, couple qol improvements

* change flag per v3

* use poetry export install here too

* lock

* export install here too

* add verbose, just copy jar instead of download

* use environ directly

* Set RT validator version as metadata and fix a bug (#1732)

* set rt validator version as metadata

* add validator version in metadata and put extract under a key

* fix schedule data exception string representation and assert after outcomes upload

* fix poetry in docker, lock

* use export and install

* update typer

* fix schedule downloading... also add url filter to cli

* get latest validator from github just in case, and keep name

* rename this here too

* address PR comments

* add pool for airtable (#1743)

* deprecate airtable v1 extracts (#1699)

* deprecate airtable v1 extracts

* delete v1 airtable operator

* Change column name to fix run error (#1730)

* change date col name

* fix service_date col

* chore: remove evansiroky from most CODEOWNERS items (#1735)

* kubernetes: prod-sftp-ingest-elavon: add elavon ssh public key (#1742)

* Add GTFS guideline check for wheelchair fields in trips.txt & stops.txt (#1739)

* initial non-working draft

* Lightly testing working version

* Add check staging table to main fact table

* Switch to using gtfs_schedule_index_feed_trip_stops

* revert poetry.lock back to main

* implement laurie's suggested simplification

* add variable to group by

* fix: delete broken airtable import, cleanup from PR #1699 (#1752)

* Fix authorized_keys format for elavon SFTP ingest (#1755)

* kubernetes: dev-sftp-ingest-elavon: fix authorized_keys newline

* kubernetes: prod-sftp-ingest-elavon: fix authorized_keys newline

Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Angela Tran <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Scott Owades <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az75-326>
atvaccaro added a commit that referenced this pull request Jun 15, 2023
* ↥ initialized release-candidate

* Deploy Elavon SFTP ingest server into production (#1695)

* docs(datasets and tables): added and revamped RT dataset section and docs with links out to dbt docs

* docs(datasets and tables): added highlighting to important areas in dataset docs

* (portfolio docs): add notebook tips

* (portfolio docs): change order of sections

* (portfolio docs): more description about decimals and rounding

* (portfolio docs): fix formatting

* (portfolio docs): fix formatting

* start on gtfsrt-v2

* (portfolio_docs): fix typo

* docs: testing broken action

* switch rt node pool to c2 instances

* get new calitp version

* import gcs models from calitp-py!

* airtable: add macro TODO

* airtable: gtfs datasets mart & staging updates

* airtable: reenable airtable warehouse resources

* airtable: use correct incoming id names in staging tables

* airtable: get tts actually working with prior updates

* airtable: gtfs service data

* airtable: add provider/gtfs table -- fixes #1487

* airtable: clean up some ctes

* airtable: clean up references based on some schema changes

* airtable: add relationship tests for foreign keys in mart

* airtable: start renaming int to base

* airtable: refactor staging tables to be historical; refactor get latest macro to enable daily extract selection

* airtable: convert staging to views rather than tables

* airtable: convert intermediate mapping tables to base

* always compile, but only check dbt run success after docs/metabase

* run tests even if run failed

* airtable: define key as metabase PK

* airtable: add equal row count tests for models with id mapping

* airtable: rename map to bridge

* update poetry.lock for dbt-metabase

* airtable: latest-only-ify bridge tables

* missed a couple

* airtable: make mart latest-only

* airtable: refactor dim service components

* airtable: specify metabase FK columns

* airtable: new fields & tables to address #1630

* airtable: make bridge tables date-aware and assorted small fixes

* get us going!

* airtable: address failing dbt tests -- minor tweaks

* airtable: more failing dbt tests

* airtable: refactor service components to handle duplicates

* airtable: fix legacy airtable source definition to reference views

* airtable: remove redundant metabase FK metadata

* airtable: fix test syntax

* airtable: use QUALIFY to simplify ranked queries

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests

* use new calitp version

* fix(timestamp partitions): explicitly use isoformat string

* style: rename CTEs to be more specific

* farm surrogate key macro: coalesce nulls in macro itself

* add notebook used to re-name a partition

* chore: remove pyup config file

no longer in use

* chore: remove pyup ignore statement

* airtable: use ts instead of time

* add airtable mart to list of things synced to metabase

* update metabase database names again

* warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables

* warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views

* warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse

* warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references

* print message if deploy is not set

* round lat/lons, specify 4m accuracy, add new resources

* print the documentation file being written

* add coord system, disable shapes for now due to size limit

* fix(fact daily trips timeout): wip incremental table

* update to good stable version of sqlfluff

* fix: make fact daily trips incremental -- WIP

* pass and/or ignore new rules

* linter

* fact daily trips: remove dev incremental check

* docs: update airtable prod maintenance instructions

* docs: add new dags to dependency diagram

* docs: add spacing to help w line wrapping

* docs: more spaces for line wrapping...

* dbt-metabase: update version in poetry; comment out failing relationship tests

* warehouse(payments_views): got payments_rides working and migrated, added yml and metadata,  added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work

* get new calitp version

* import gcs models from calitp-py!

* missed a couple

* get us going!

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* fix(timestamp partitions): explicitly use isoformat string

* use new calitp version

* start experimenting with task queue options and metrics

* get this working and test performance with greenlets

* couple more metrics

* wip testing with multiple consumers at high volume

* start optimizing for lots of small tasks; have to make redis interaction fast

* fix key str format

* couple more libs

* wip

* wip on discussed changes

* get the keys from environ for now

* use new calitp py

* print a bit more

* we are just gonna get stuff in the env

* commit this before I break anything

* fmt

* bump calitp-py

* lint

* rename v2 to v3 since 2.X tags already exist

* kinda make this runnable

* new node pool just dropped

* get running in docker compose to kick the tires

* start on RT v3 k8s

* get the consumer working mostly?

* label redis pod appropriately

* tell consumer about temp rt secrets

* that was dumb

* ticker k8s!

* set expire time on the huey instance

* point consumer at svc account json

* avoid pulling the stacktrace in

* scrape on 9102

* bump to 16 workers per consumer

* bump jupyterhub storage to 32gi

* add these back!

* add comment

* bring in new calitp and fix tick rounding

* improve metrics and labels

* warehouse(payments): removed payemnts_rides_refactor from yml file

* clean up labels

* get secrets from secret manager sdk before the consumer starts...

* missed this

* fix secrets volume and adjust affinities

* warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables

* warehouse(payments): removed the old intermediate tables from the dbt project yaml file

* add content type header to bytes

* ugh whitespace

* warehouse: fixing linting error

* warehouse: fixing linting error again

* warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future

* fix: update Mountain Transit URL

* remove celery and gevent from pyproject deps

Co-authored-by: Mjumbe Poe <[email protected]>

* we might as well specify huey app name by env as well just in case we end up on the same redis in the future

* write to the prod bucket!

* create a preprod version and deploy it

* run fewer workers in preprod

* move pull policies to patches, and only run 1 dev consumer

* add redis considerations to readme

* docs(datasets and tables): revised informationon dbt docs for views tables based on PR review

* docs(datasets and tables): revised for readability

* docs(datasets and tables): revised docs information for gtfs schedule based on PR review

* docs(datasets and tables): fixed readability

* docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions

* docs(datasets and tables): revamped the overview page for datasets and tables

* docs(datasets and tables): cleaned up readability

* bump version and start adding more logging context

* specifically log request errors that do not come from raise_for_status

* set v3 image versions separately

* bump to 8 workers and improve log formatting

* formatting

* fix string representation of exception type in logs

* bump prod to 3.1

* oops

* hotfix version

* bump to 30m

* warehouse(airflow): deleted the empty payments_views_staging dag directory

* warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task

* docs(airflow): edited the production dags docs to reflect changes in payments staging views dags

* docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies

* Update new-team-member.md

Fixed added missing meetings, deleted old meetings. deleted auto-assign

* docs(datasets ans tables): reconfigured some pages for readability

* docs(datasets and tables): re-reviewed and added clarity

* fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number

* docs(production maintenance): added n/a for dependencies for payments_views

* docs(datasets and tables): created new page with content on how to use dbt docs, added to toc

* docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections

* (analyst_docs): update gcloud commands

* fix(open data): make test_metadata attribute optional to account for singular tests

* docs(datasets and tables): reformatted for readability and conciseness

* docs(datasets and tables): revisions based on Laurie's review

* docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc

* fix(open data): suppress publishing stop_times because of size limit issue

* agencies.yml: update FCRTA and add Escalon Transit

* agencies.yml: rename escalon transit to etrans

* fix(airflow/gtfs_loader): replace non-utf-8 characters

* feat(airtable): add new columns per request #1674

* fix(airtable data): address review comments PR #1677

* fix: add WeHo RT URLs

* fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set

* update calitp py and change log

* make docker compose work

* specify buckets and bump version in dev

* now do prod

* change logging

* add weho key

* bump gtfs rt v3 version

* bump calitp py

* deploy new image to dev

* get dev and prod working with bucket env vars

* bump calitp py and expire cache every 5 minutes

* deploy new cache clearing to prod/dev

* make sure calitp is updated, load secrets in ticker too

* fix docker compose, use new flags, deploy new image to dev

* bump prod

* add airtable age metric, bump version, scrape ticker

* delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681)

* docs: correct Transit Technology Stacks title (#1565)

The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that.

* fix: update GRaaS URLs (#1690)

* New schedule pipeline validation job (#1648)

* wip on validation in new schedule pipeline

* bring in stuff from calitp storage, work on saving validations/outcomes

* wip getting this working

* use new calitp prerelease, fix filenames/content, remove break

* oops

* working!

* update lockfile

* unzip/validate schedule dag

* remove this

* bring in latest calitp-py

* extra print

* pass env vars into pod

* fix lint

* add readme

* bring in latest calitp

* fix print and formatting

* bring the outcome-only classes over, and use env var for bucket

* filter out nones for RT airtable records

* bring in latest calitp py

* get latest calitp

* use new env var and rename validation job results

* start updating airflow with new calitp py and using bucket env vars

* test schedule downloader with new calitp

* new calitp

* handle new calitp, better logging

* add env vars for new calitp

* put prefix_bucket back for parse_and_validate_rt and document env var configuration

* comments

* use new version of caltip py with good gcsfs (#1693)

* use new version of caltip py with good gcsfs

* use the regular release

* docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430)

* docs(agency): adding reference table for analysts to define agency in their research

* docs(agency): fixed table formatting error

* docs(agency): fixed table formatting error plus pre-commit hooks

* docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks

* docs: formatting errors, added missing capitalization

* docs: formatting table with list

* docs: formatting table with no line break - attempt 1

* docs: clarified language and spacing in table

* docs: clarified language in table

* docs: removing extra information from agency table

* docs: removing extra information from agency table pt 2

* docs: removing extra information from agency table pt 3

* docs: reworked table to include gtfs-provider-service relationships

* docs: added space for the gtfs provider's services section

* docs: added space for the gtfs provider's services section syntax corrections

* docs: added space for the gtfs provider's services section syntax corrections again

* docs: clarified information arounf gtfs provider relationships

* docs: clarified information around gtfs provider relationships and intro content

* docs: agency table revisions based on call with E

* docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table

* docs(agency reference): reformatted table

* docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling

* docs: renamed python library file h1 to be more intuitive

* docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables

* docs(add to what_is_agency)

* docs(warehouse): fixed some typos, errors, and formatting issues

Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>

* we also have to pin a specific fsspec version directly in the requirements (#1694)

* Create SFTP ingest component for Elavon data (#1692)

* kubernetes: sftp-ingest-elavon: add server component

* kubernetes: sftp-server: add sshd configuration

This enables functionality like chroot'd logins and disabling of shell
logins.

* kubernetes: sftp-server: add readinessProbe

Since the container is essentially built at startup, there is a sizeable
time delta between container startup and ssh server startup. This
addition helps the operator easily detect when installation is complete
and the service is running.

* kubernetes: sftp-server: add cluster service

This enables cluster workloads to login using a DNS names.

* kubernetes: sftp-server: refactor bootstrap script for better DRY

* kubernetes: prod-sftp-ingest-elavon: create production localization

* kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml

This exposes the SFTP port for inbound connections from the vendor.

* ci: prod-sftp-ingest-elavon.env: enable prod deployment

Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az123-804>

* Deploy elavon SFTP credentials into production (#1749)

* airtable: start renaming int to base

* airtable: refactor staging tables to be historical; refactor get latest macro to enable daily extract selection

* airtable: convert staging to views rather than tables

* airtable: convert intermediate mapping tables to base

* always compile, but only check dbt run success after docs/metabase

* run tests even if run failed

* airtable: define key as metabase PK

* airtable: add equal row count tests for models with id mapping

* airtable: rename map to bridge

* update poetry.lock for dbt-metabase

* airtable: latest-only-ify bridge tables

* missed a couple

* airtable: make mart latest-only

* airtable: refactor dim service components

* airtable: specify metabase FK columns

* airtable: new fields & tables to address #1630

* airtable: make bridge tables date-aware and assorted small fixes

* get us going!

* airtable: address failing dbt tests -- minor tweaks

* airtable: more failing dbt tests

* airtable: refactor service components to handle duplicates

* airtable: fix legacy airtable source definition to reference views

* airtable: remove redundant metabase FK metadata

* airtable: fix test syntax

* airtable: use QUALIFY to simplify ranked queries

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests

* use new calitp version

* fix(timestamp partitions): explicitly use isoformat string

* style: rename CTEs to be more specific

* farm surrogate key macro: coalesce nulls in macro itself

* add notebook used to re-name a partition

* chore: remove pyup config file

no longer in use

* chore: remove pyup ignore statement

* airtable: use ts instead of time

* add airtable mart to list of things synced to metabase

* update metabase database names again

* warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables

* warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views

* warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse

* warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references

* print message if deploy is not set

* round lat/lons, specify 4m accuracy, add new resources

* print the documentation file being written

* add coord system, disable shapes for now due to size limit

* fix(fact daily trips timeout): wip incremental table

* update to good stable version of sqlfluff

* fix: make fact daily trips incremental -- WIP

* pass and/or ignore new rules

* linter

* fact daily trips: remove dev incremental check

* docs: update airtable prod maintenance instructions

* docs: add new dags to dependency diagram

* docs: add spacing to help w line wrapping

* docs: more spaces for line wrapping...

* dbt-metabase: update version in poetry; comment out failing relationship tests

* warehouse(payments_views): got payments_rides working and migrated, added yml and metadata,  added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work

* get new calitp version

* import gcs models from calitp-py!

* missed a couple

* get us going!

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* fix(timestamp partitions): explicitly use isoformat string

* use new calitp version

* start experimenting with task queue options and metrics

* get this working and test performance with greenlets

* couple more metrics

* wip testing with multiple consumers at high volume

* start optimizing for lots of small tasks; have to make redis interaction fast

* fix key str format

* couple more libs

* wip

* wip on discussed changes

* get the keys from environ for now

* use new calitp py

* print a bit more

* we are just gonna get stuff in the env

* commit this before I break anything

* fmt

* bump calitp-py

* lint

* rename v2 to v3 since 2.X tags already exist

* kinda make this runnable

* new node pool just dropped

* get running in docker compose to kick the tires

* start on RT v3 k8s

* get the consumer working mostly?

* label redis pod appropriately

* tell consumer about temp rt secrets

* that was dumb

* ticker k8s!

* set expire time on the huey instance

* point consumer at svc account json

* avoid pulling the stacktrace in

* scrape on 9102

* bump to 16 workers per consumer

* bump jupyterhub storage to 32gi

* add these back!

* add comment

* bring in new calitp and fix tick rounding

* improve metrics and labels

* warehouse(payments): removed payemnts_rides_refactor from yml file

* clean up labels

* get secrets from secret manager sdk before the consumer starts...

* missed this

* fix secrets volume and adjust affinities

* warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables

* warehouse(payments): removed the old intermediate tables from the dbt project yaml file

* add content type header to bytes

* ugh whitespace

* warehouse: fixing linting error

* warehouse: fixing linting error again

* warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future

* fix: update Mountain Transit URL

* remove celery and gevent from pyproject deps

Co-authored-by: Mjumbe Poe <[email protected]>

* we might as well specify huey app name by env as well just in case we end up on the same redis in the future

* write to the prod bucket!

* create a preprod version and deploy it

* run fewer workers in preprod

* move pull policies to patches, and only run 1 dev consumer

* add redis considerations to readme

* docs(datasets and tables): revised informationon dbt docs for views tables based on PR review

* docs(datasets and tables): revised for readability

* docs(datasets and tables): revised docs information for gtfs schedule based on PR review

* docs(datasets and tables): fixed readability

* docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions

* docs(datasets and tables): revamped the overview page for datasets and tables

* docs(datasets and tables): cleaned up readability

* bump version and start adding more logging context

* specifically log request errors that do not come from raise_for_status

* set v3 image versions separately

* bump to 8 workers and improve log formatting

* formatting

* fix string representation of exception type in logs

* bump prod to 3.1

* oops

* hotfix version

* bump to 30m

* warehouse(airflow): deleted the empty payments_views_staging dag directory

* warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task

* docs(airflow): edited the production dags docs to reflect changes in payments staging views dags

* docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies

* Update new-team-member.md

Fixed added missing meetings, deleted old meetings. deleted auto-assign

* docs(datasets ans tables): reconfigured some pages for readability

* docs(datasets and tables): re-reviewed and added clarity

* fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number

* docs(production maintenance): added n/a for dependencies for payments_views

* docs(datasets and tables): created new page with content on how to use dbt docs, added to toc

* docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections

* (analyst_docs): update gcloud commands

* fix(open data): make test_metadata attribute optional to account for singular tests

* docs(datasets and tables): reformatted for readability and conciseness

* docs(datasets and tables): revisions based on Laurie's review

* docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc

* fix(open data): suppress publishing stop_times because of size limit issue

* agencies.yml: update FCRTA and add Escalon Transit

* agencies.yml: rename escalon transit to etrans

* fix(airflow/gtfs_loader): replace non-utf-8 characters

* feat(airtable): add new columns per request #1674

* fix(airtable data): address review comments PR #1677

* fix: add WeHo RT URLs

* fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set

* update calitp py and change log

* make docker compose work

* specify buckets and bump version in dev

* now do prod

* change logging

* add weho key

* bump gtfs rt v3 version

* bump calitp py

* deploy new image to dev

* get dev and prod working with bucket env vars

* bump calitp py and expire cache every 5 minutes

* deploy new cache clearing to prod/dev

* make sure calitp is updated, load secrets in ticker too

* fix docker compose, use new flags, deploy new image to dev

* bump prod

* add airtable age metric, bump version, scrape ticker

* delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681)

* docs: correct Transit Technology Stacks title (#1565)

The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that.

* fix: update GRaaS URLs (#1690)

* New schedule pipeline validation job (#1648)

* wip on validation in new schedule pipeline

* bring in stuff from calitp storage, work on saving validations/outcomes

* wip getting this working

* use new calitp prerelease, fix filenames/content, remove break

* oops

* working!

* update lockfile

* unzip/validate schedule dag

* remove this

* bring in latest calitp-py

* extra print

* pass env vars into pod

* fix lint

* add readme

* bring in latest calitp

* fix print and formatting

* bring the outcome-only classes over, and use env var for bucket

* filter out nones for RT airtable records

* bring in latest calitp py

* get latest calitp

* use new env var and rename validation job results

* start updating airflow with new calitp py and using bucket env vars

* test schedule downloader with new calitp

* new calitp

* handle new calitp, better logging

* add env vars for new calitp

* put prefix_bucket back for parse_and_validate_rt and document env var configuration

* comments

* use new version of caltip py with good gcsfs (#1693)

* use new version of caltip py with good gcsfs

* use the regular release

* docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430)

* docs(agency): adding reference table for analysts to define agency in their research

* docs(agency): fixed table formatting error

* docs(agency): fixed table formatting error plus pre-commit hooks

* docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks

* docs: formatting errors, added missing capitalization

* docs: formatting table with list

* docs: formatting table with no line break - attempt 1

* docs: clarified language and spacing in table

* docs: clarified language in table

* docs: removing extra information from agency table

* docs: removing extra information from agency table pt 2

* docs: removing extra information from agency table pt 3

* docs: reworked table to include gtfs-provider-service relationships

* docs: added space for the gtfs provider's services section

* docs: added space for the gtfs provider's services section syntax corrections

* docs: added space for the gtfs provider's services section syntax corrections again

* docs: clarified information arounf gtfs provider relationships

* docs: clarified information around gtfs provider relationships and intro content

* docs: agency table revisions based on call with E

* docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table

* docs(agency reference): reformatted table

* docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling

* docs: renamed python library file h1 to be more intuitive

* docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables

* docs(add to what_is_agency)

* docs(warehouse): fixed some typos, errors, and formatting issues

Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>

* we also have to pin a specific fsspec version directly in the requirements (#1694)

* Create SFTP ingest component for Elavon data (#1692)

* kubernetes: sftp-ingest-elavon: add server component

* kubernetes: sftp-server: add sshd configuration

This enables functionality like chroot'd logins and disabling of shell
logins.

* kubernetes: sftp-server: add readinessProbe

Since the container is essentially built at startup, there is a sizeable
time delta between container startup and ssh server startup. This
addition helps the operator easily detect when installation is complete
and the service is running.

* kubernetes: sftp-server: add cluster service

This enables cluster workloads to login using a DNS names.

* kubernetes: sftp-server: refactor bootstrap script for better DRY

* kubernetes: prod-sftp-ingest-elavon: create production localization

* kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml

This exposes the SFTP port for inbound connections from the vendor.

* ci: prod-sftp-ingest-elavon.env: enable prod deployment

* Fix typo in `what is agency` (#1698)

it's --> it's

* limit schedule validation jobs with a pool (#1700)

* Created new row-level access policy macro and applied it to payments_rides (#1697)

* created new row-level access policy and applied it to payments rides with newly generated service accounts

* ran pre-commit hooks to fix failing actions

Co-authored-by: Charles Costanzo <[email protected]>

* deploy voila fix (#1702)

* disable autodetect if schema is specified (#1704)

* Create v2 RT parsing and validation jobs in Airflow and creates external tables (#1691)

* start on new parsing job

* comment and fmt

* wip getting parsing working

* fmt

* get parsing working!

* save outcomes file properly

* remove old validator and dupe log

* this is only jsonl right now, so this workaround is bad

* wip on validation

* wip

* get parsing working, start simplifying

* get validation working with schedules referenced by airtable!

* missed this

* get the actual rt v2 airflow jobs mostly working

* missed this

* run v2 RT jobs at :15 instead of :30

* convert metadata field names to bq-safe

* fix being able to template bucket and test out rt_service_alerts_v2 external table

* add outcomes external table to test

* wip trying to get a debugger to test pydantic custom serialization

* fix rt outcome serialization to be bq safe

* create rest of rt v2 external tables

* couple small fixes

* start addressing PR comments

* address PR comment

* add ci/cd action to build gtfs-rt-parser-v2 image

* Fix: skip amplitude_benefits DAG if 404 (#1705)

* fix(amplitude): mark skip when 404 is encountered

* chore(amplitude): add some logging statements around API call

* Gtfs schedule unzip v2 (#1696)

* gtfs loader v2 wip

* gtfs unzipper v2: semi-working WIP -- can unzip at least one zipfile

* address initial review comments

* bump calitp version

* gtfs unzipper v2: working version with required functionality

* update calitp and make the downloader run with it

* gtfs unzipper v2: get working in airflow; use logging

* rename to distinguish zipfile from extracted files within zipfile

* resolve reviewer comments

* gtfs unzipper v2: refactor to raise exceptions on unparseable zips

* gtfs unzipper: further simplify exception handling

* final tweaks -- refactor of checking for invalid zip structure, tighten up processing of valid files

* comment typos/clarifications

Co-authored-by: Andrew Vaccaro <[email protected]>

* warehouse: added fare_systems transit database mart table (#1701)

* warehouse: added fare_systems transit database mart table

* warehouse: fixed duplicate doc macro issue for fare_systems

* explicitly declared schema

* removed columns no longer relevant

* warehouse: added bridge table for fare_systems x services

* warehouse: added bridge table for fare_systems x services to yaml

* Clean up RT outcomes (#1709)

* remove unnecessary json_encoders

* just save the extract path

* add a dockerignore

* grant access to payments_rides for non-agency users (#1714)

* grant access to payments_rides for non-agency users

* just use calitp domain and add a couple other users

* Run CKAN weekly, with multipart uploads as needed (#1710)

* wip getting multipart upload to ckan working

* remove before I forget

* mirror the example script... we get 500s with too many chunks

* commit this while it is working

* add this back

* allow env var to control target and bucket

* create weekly task to run publish california_open_data

* allow manifest to be in gcs

* get this actually working...

* dockerignore

* clean up names, add resource requests, make work in pod operator

* address PR comments

* load this from a secret (#1717)

* Initial dbt models to support GTFS guidelines checks (#1712)

* initial work towards #1688

* gtfs guidelines initial implementation: tweaks & improvements

* gtfs guidelines: add metabase semantic type for calitp agency name

* sync new dataset to metabase

* gtfs guidelines: rename table, formatting updates

* rename compliance gtfs feature per PR review

* Add RT VP vs Sched Table (#1708)

* add table

* add table

* add operator

* fix sql syntax

* fix failing indentations

* add unique test

* fix .yml test

* Create local Dockerfile and bash script for dbt development (#1711)

* start on local dev dockerfile

* handle local profiles dir

* make dbt docker work with local google credentials

* add build-essentials per recommendation

* update poetry install method and add libgdal-dev

* poetry changed its bin location

* Improvements to dbt artifacts and publish workflow (#1726)

* add ts partition to publish artifacts

* also save artifacts with timestamps vs just latest

* start simplifying publish script, proper dry runs, reading manifest from gcs

* fix publish assert, use env vars, simplify logging

* allow resource descriptions in publishing, allow direct remote writing

* ugh

* need to be utc

* bring in simplified descriptions

* missed bucket

* upload metadata/dictionary to gcs for ckan; also fix bug

* update ckan docs to reflect publishing changes

* actually these should always get written

* env vars not templating

* fix timestamped artifact names

* pretty print

* address pr comments

* update ckan publishing docs

* actually set ckan precision fields and use them

* uppercase field types and allow specifying a model to publish

* bad dict key

* these are length 7

* lats are only 6 digits

* warehouse documentation: add calitp_itp_id and calitp_url_number metadata to several dimensional columns (#1733)

* airtable organizations: define external table schemas (#1734)

* Upgrade schedule validator and save version as metadata (#1729)

* update to v3 validator, fix dockerfile

* finally deploy the schedule validator image through github actions

* bring in latest calitp

* use new calitp, simplify metadata, add version to notice rows, couple qol improvements

* change flag per v3

* use poetry export install here too

* lock

* export install here too

* add verbose, just copy jar instead of download

* use environ directly

* Set RT validator version as metadata and fix a bug (#1732)

* set rt validator version as metadata

* add validator version in metadata and put extract under a key

* fix schedule data exception string representation and assert after outcomes upload

* fix poetry in docker, lock

* use export and install

* update typer

* fix schedule downloading... also add url filter to cli

* get latest validator from github just in case, and keep name

* rename this here too

* address PR comments

* add pool for airtable (#1743)

* deprecate airtable v1 extracts (#1699)

* deprecate airtable v1 extracts

* delete v1 airtable operator

* Change column name to fix run error (#1730)

* change date col name

* fix service_date col

* chore: remove evansiroky from most CODEOWNERS items (#1735)

* kubernetes: prod-sftp-ingest-elavon: add elavon ssh public key (#1742)

Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Angela Tran <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az173-876>

* Deploy authorized_keys format fix into production (#1756)

* airtable: convert intermediate mapping tables to base

* always compile, but only check dbt run success after docs/metabase

* run tests even if run failed

* airtable: define key as metabase PK

* airtable: add equal row count tests for models with id mapping

* airtable: rename map to bridge

* update poetry.lock for dbt-metabase

* airtable: latest-only-ify bridge tables

* missed a couple

* airtable: make mart latest-only

* airtable: refactor dim service components

* airtable: specify metabase FK columns

* airtable: new fields & tables to address #1630

* airtable: make bridge tables date-aware and assorted small fixes

* get us going!

* airtable: address failing dbt tests -- minor tweaks

* airtable: more failing dbt tests

* airtable: refactor service components to handle duplicates

* airtable: fix legacy airtable source definition to reference views

* airtable: remove redundant metabase FK metadata

* airtable: fix test syntax

* airtable: use QUALIFY to simplify ranked queries

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests

* use new calitp version

* fix(timestamp partitions): explicitly use isoformat string

* style: rename CTEs to be more specific

* farm surrogate key macro: coalesce nulls in macro itself

* add notebook used to re-name a partition

* chore: remove pyup config file

no longer in use

* chore: remove pyup ignore statement

* airtable: use ts instead of time

* add airtable mart to list of things synced to metabase

* update metabase database names again

* warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables

* warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views

* warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse

* warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references

* print message if deploy is not set

* round lat/lons, specify 4m accuracy, add new resources

* print the documentation file being written

* add coord system, disable shapes for now due to size limit

* fix(fact daily trips timeout): wip incremental table

* update to good stable version of sqlfluff

* fix: make fact daily trips incremental -- WIP

* pass and/or ignore new rules

* linter

* fact daily trips: remove dev incremental check

* docs: update airtable prod maintenance instructions

* docs: add new dags to dependency diagram

* docs: add spacing to help w line wrapping

* docs: more spaces for line wrapping...

* dbt-metabase: update version in poetry; comment out failing relationship tests

* warehouse(payments_views): got payments_rides working and migrated, added yml and metadata,  added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work

* get new calitp version

* import gcs models from calitp-py!

* missed a couple

* get us going!

* fix: make airtable gcs operator use timestamps rather than time string

* fix(timestamp partitions): update calitp version to get schedule partition updates

* fix(timestamp partitions): explicitly use isoformat string

* use new calitp version

* start experimenting with task queue options and metrics

* get this working and test performance with greenlets

* couple more metrics

* wip testing with multiple consumers at high volume

* start optimizing for lots of small tasks; have to make redis interaction fast

* fix key str format

* couple more libs

* wip

* wip on discussed changes

* get the keys from environ for now

* use new calitp py

* print a bit more

* we are just gonna get stuff in the env

* commit this before I break anything

* fmt

* bump calitp-py

* lint

* rename v2 to v3 since 2.X tags already exist

* kinda make this runnable

* new node pool just dropped

* get running in docker compose to kick the tires

* start on RT v3 k8s

* get the consumer working mostly?

* label redis pod appropriately

* tell consumer about temp rt secrets

* that was dumb

* ticker k8s!

* set expire time on the huey instance

* point consumer at svc account json

* avoid pulling the stacktrace in

* scrape on 9102

* bump to 16 workers per consumer

* bump jupyterhub storage to 32gi

* add these back!

* add comment

* bring in new calitp and fix tick rounding

* improve metrics and labels

* warehouse(payments): removed payemnts_rides_refactor from yml file

* clean up labels

* get secrets from secret manager sdk before the consumer starts...

* missed this

* fix secrets volume and adjust affinities

* warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables

* warehouse(payments): removed the old intermediate tables from the dbt project yaml file

* add content type header to bytes

* ugh whitespace

* warehouse: fixing linting error

* warehouse: fixing linting error again

* warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future

* fix: update Mountain Transit URL

* remove celery and gevent from pyproject deps

Co-authored-by: Mjumbe Poe <[email protected]>

* we might as well specify huey app name by env as well just in case we end up on the same redis in the future

* write to the prod bucket!

* create a preprod version and deploy it

* run fewer workers in preprod

* move pull policies to patches, and only run 1 dev consumer

* add redis considerations to readme

* docs(datasets and tables): revised informationon dbt docs for views tables based on PR review

* docs(datasets and tables): revised for readability

* docs(datasets and tables): revised docs information for gtfs schedule based on PR review

* docs(datasets and tables): fixed readability

* docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions

* docs(datasets and tables): revamped the overview page for datasets and tables

* docs(datasets and tables): cleaned up readability

* bump version and start adding more logging context

* specifically log request errors that do not come from raise_for_status

* set v3 image versions separately

* bump to 8 workers and improve log formatting

* formatting

* fix string representation of exception type in logs

* bump prod to 3.1

* oops

* hotfix version

* bump to 30m

* warehouse(airflow): deleted the empty payments_views_staging dag directory

* warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task

* docs(airflow): edited the production dags docs to reflect changes in payments staging views dags

* docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies

* Update new-team-member.md

Fixed added missing meetings, deleted old meetings. deleted auto-assign

* docs(datasets ans tables): reconfigured some pages for readability

* docs(datasets and tables): re-reviewed and added clarity

* fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number

* docs(production maintenance): added n/a for dependencies for payments_views

* docs(datasets and tables): created new page with content on how to use dbt docs, added to toc

* docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections

* (analyst_docs): update gcloud commands

* fix(open data): make test_metadata attribute optional to account for singular tests

* docs(datasets and tables): reformatted for readability and conciseness

* docs(datasets and tables): revisions based on Laurie's review

* docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc

* fix(open data): suppress publishing stop_times because of size limit issue

* agencies.yml: update FCRTA and add Escalon Transit

* agencies.yml: rename escalon transit to etrans

* fix(airflow/gtfs_loader): replace non-utf-8 characters

* feat(airtable): add new columns per request #1674

* fix(airtable data): address review comments PR #1677

* fix: add WeHo RT URLs

* fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set

* update calitp py and change log

* make docker compose work

* specify buckets and bump version in dev

* now do prod

* change logging

* add weho key

* bump gtfs rt v3 version

* bump calitp py

* deploy new image to dev

* get dev and prod working with bucket env vars

* bump calitp py and expire cache every 5 minutes

* deploy new cache clearing to prod/dev

* make sure calitp is updated, load secrets in ticker too

* fix docker compose, use new flags, deploy new image to dev

* bump prod

* add airtable age metric, bump version, scrape ticker

* delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681)

* docs: correct Transit Technology Stacks title (#1565)

The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that.

* fix: update GRaaS URLs (#1690)

* New schedule pipeline validation job (#1648)

* wip on validation in new schedule pipeline

* bring in stuff from calitp storage, work on saving validations/outcomes

* wip getting this working

* use new calitp prerelease, fix filenames/content, remove break

* oops

* working!

* update lockfile

* unzip/validate schedule dag

* remove this

* bring in latest calitp-py

* extra print

* pass env vars into pod

* fix lint

* add readme

* bring in latest calitp

* fix print and formatting

* bring the outcome-only classes over, and use env var for bucket

* filter out nones for RT airtable records

* bring in latest calitp py

* get latest calitp

* use new env var and rename validation job results

* start updating airflow with new calitp py and using bucket env vars

* test schedule downloader with new calitp

* new calitp

* handle new calitp, better logging

* add env vars for new calitp

* put prefix_bucket back for parse_and_validate_rt and document env var configuration

* comments

* use new version of caltip py with good gcsfs (#1693)

* use new version of caltip py with good gcsfs

* use the regular release

* docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430)

* docs(agency): adding reference table for analysts to define agency in their research

* docs(agency): fixed table formatting error

* docs(agency): fixed table formatting error plus pre-commit hooks

* docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks

* docs: formatting errors, added missing capitalization

* docs: formatting table with list

* docs: formatting table with no line break - attempt 1

* docs: clarified language and spacing in table

* docs: clarified language in table

* docs: removing extra information from agency table

* docs: removing extra information from agency table pt 2

* docs: removing extra information from agency table pt 3

* docs: reworked table to include gtfs-provider-service relationships

* docs: added space for the gtfs provider's services section

* docs: added space for the gtfs provider's services section syntax corrections

* docs: added space for the gtfs provider's services section syntax corrections again

* docs: clarified information arounf gtfs provider relationships

* docs: clarified information around gtfs provider relationships and intro content

* docs: agency table revisions based on call with E

* docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table

* docs(agency reference): reformatted table

* docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling

* docs: renamed python library file h1 to be more intuitive

* docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables

* docs(add to what_is_agency)

* docs(warehouse): fixed some typos, errors, and formatting issues

Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>

* we also have to pin a specific fsspec version directly in the requirements (#1694)

* Create SFTP ingest component for Elavon data (#1692)

* kubernetes: sftp-ingest-elavon: add server component

* kubernetes: sftp-server: add sshd configuration

This enables functionality like chroot'd logins and disabling of shell
logins.

* kubernetes: sftp-server: add readinessProbe

Since the container is essentially built at startup, there is a sizeable
time delta between container startup and ssh server startup. This
addition helps the operator easily detect when installation is complete
and the service is running.

* kubernetes: sftp-server: add cluster service

This enables cluster workloads to login using a DNS names.

* kubernetes: sftp-server: refactor bootstrap script for better DRY

* kubernetes: prod-sftp-ingest-elavon: create production localization

* kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml

This exposes the SFTP port for inbound connections from the vendor.

* ci: prod-sftp-ingest-elavon.env: enable prod deployment

* Fix typo in `what is agency` (#1698)

it's --> it's

* limit schedule validation jobs with a pool (#1700)

* Created new row-level access policy macro and applied it to payments_rides (#1697)

* created new row-level access policy and applied it to payments rides with newly generated service accounts

* ran pre-commit hooks to fix failing actions

Co-authored-by: Charles Costanzo <[email protected]>

* deploy voila fix (#1702)

* disable autodetect if schema is specified (#1704)

* Create v2 RT parsing and validation jobs in Airflow and creates external tables (#1691)

* start on new parsing job

* comment and fmt

* wip getting parsing working

* fmt

* get parsing working!

* save outcomes file properly

* remove old validator and dupe log

* this is only jsonl right now, so this workaround is bad

* wip on validation

* wip

* get parsing working, start simplifying

* get validation working with schedules referenced by airtable!

* missed this

* get the actual rt v2 airflow jobs mostly working

* missed this

* run v2 RT jobs at :15 instead of :30

* convert metadata field names to bq-safe

* fix being able to template bucket and test out rt_service_alerts_v2 external table

* add outcomes external table to test

* wip trying to get a debugger to test pydantic custom serialization

* fix rt outcome serialization to be bq safe

* create rest of rt v2 external tables

* couple small fixes

* start addressing PR comments

* address PR comment

* add ci/cd action to build gtfs-rt-parser-v2 image

* Fix: skip amplitude_benefits DAG if 404 (#1705)

* fix(amplitude): mark skip when 404 is encountered

* chore(amplitude): add some logging statements around API call

* Gtfs schedule unzip v2 (#1696)

* gtfs loader v2 wip

* gtfs unzipper v2: semi-working WIP -- can unzip at least one zipfile

* address initial review comments

* bump calitp version

* gtfs unzipper v2: working version with required functionality

* update calitp and make the downloader run with it

* gtfs unzipper v2: get working in airflow; use logging

* rename to distinguish zipfile from extracted files within zipfile

* resolve reviewer comments

* gtfs unzipper v2: refactor to raise exceptions on unparseable zips

* gtfs unzipper: further simplify exception handling

* final tweaks -- refactor of checking for invalid zip structure, tighten up processing of valid files

* comment typos/clarifications

Co-authored-by: Andrew Vaccaro <[email protected]>

* warehouse: added fare_systems transit database mart table (#1701)

* warehouse: added fare_systems transit database mart table

* warehouse: fixed duplicate doc macro issue for fare_systems

* explicitly declared schema

* removed columns no longer relevant

* warehouse: added bridge table for fare_systems x services

* warehouse: added bridge table for fare_systems x services to yaml

* Clean up RT outcomes (#1709)

* remove unnecessary json_encoders

* just save the extract path

* add a dockerignore

* grant access to payments_rides for non-agency users (#1714)

* grant access to payments_rides for non-agency users

* just use calitp domain and add a couple other users

* Run CKAN weekly, with multipart uploads as needed (#1710)

* wip getting multipart upload to ckan working

* remove before I forget

* mirror the example script... we get 500s with too many chunks

* commit this while it is working

* add this back

* allow env var to control target and bucket

* create weekly task to run publish california_open_data

* allow manifest to be in gcs

* get this actually working...

* dockerignore

* clean up names, add resource requests, make work in pod operator

* address PR comments

* load this from a secret (#1717)

* Initial dbt models to support GTFS guidelines checks (#1712)

* initial work towards #1688

* gtfs guidelines initial implementation: tweaks & improvements

* gtfs guidelines: add metabase semantic type for calitp agency name

* sync new dataset to metabase

* gtfs guidelines: rename table, formatting updates

* rename compliance gtfs feature per PR review

* Add RT VP vs Sched Table (#1708)

* add table

* add table

* add operator

* fix sql syntax

* fix failing indentations

* add unique test

* fix .yml test

* Create local Dockerfile and bash script for dbt development (#1711)

* start on local dev dockerfile

* handle local profiles dir

* make dbt docker work with local google credentials

* add build-essentials per recommendation

* update poetry install method and add libgdal-dev

* poetry changed its bin location

* Improvements to dbt artifacts and publish workflow (#1726)

* add ts partition to publish artifacts

* also save artifacts with timestamps vs just latest

* start simplifying publish script, proper dry runs, reading manifest from gcs

* fix publish assert, use env vars, simplify logging

* allow resource descriptions in publishing, allow direct remote writing

* ugh

* need to be utc

* bring in simplified descriptions

* missed bucket

* upload metadata/dictionary to gcs for ckan; also fix bug

* update ckan docs to reflect publishing changes

* actually these should always get written

* env vars not templating

* fix timestamped artifact names

* pretty print

* address pr comments

* update ckan publishing docs

* actually set ckan precision fields and use them

* uppercase field types and allow specifying a model to publish

* bad dict key

* these are length 7

* lats are only 6 digits

* warehouse documentation: add calitp_itp_id and calitp_url_number metadata to several dimensional columns (#1733)

* airtable organizations: define external table schemas (#1734)

* Upgrade schedule validator and save version as metadata (#1729)

* update to v3 validator, fix dockerfile

* finally deploy the schedule validator image through github actions

* bring in latest calitp

* use new calitp, simplify metadata, add version to notice rows, couple qol improvements

* change flag per v3

* use poetry export install here too

* lock

* export install here too

* add verbose, just copy jar instead of download

* use environ directly

* Set RT validator version as metadata and fix a bug (#1732)

* set rt validator version as metadata

* add validator version in metadata and put extract under a key

* fix schedule data exception string representation and assert after outcomes upload

* fix poetry in docker, lock

* use export and install

* update typer

* fix schedule downloading... also add url filter to cli

* get latest validator from github just in case, and keep name

* rename this here too

* address PR comments

* add pool for airtable (#1743)

* deprecate airtable v1 extracts (#1699)

* deprecate airtable v1 extracts

* delete v1 airtable operator

* Change column name to fix run error (#1730)

* change date col name

* fix service_date col

* chore: remove evansiroky from most CODEOWNERS items (#1735)

* kubernetes: prod-sftp-ingest-elavon: add elavon ssh public key (#1742)

* Add GTFS guideline check for wheelchair fields in trips.txt & stops.txt (#1739)

* initial non-working draft

* Lightly testing working version

* Add check staging table to main fact table

* Switch to using gtfs_schedule_index_feed_trip_stops

* revert poetry.lock back to main

* implement laurie's suggested simplification

* add variable to group by

* fix: delete broken airtable import, cleanup from PR #1699 (#1752)

* Fix authorized_keys format for elavon SFTP ingest (#1755)

* kubernetes: dev-sftp-ingest-elavon: fix authorized_keys newline

* kubernetes: prod-sftp-ingest-elavon: fix authorized_keys newline

Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Andrew Vaccaro <[email protected]>
Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Angela Tran <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Scott Owades <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az75-326>

* clean up legacy ci scripts

* add test archiver env

---------

Co-authored-by: Github Action build-release-candidate <runner@fv-az178-772>
Co-authored-by: Github Action build-release-candidate <runner@fv-az377-222>
Co-authored-by: Github Action build-release-candidate <runner@fv-az246-231>
Co-authored-by: Github Action build-release-candidate <runner@fv-az90-299>
Co-authored-by: Github Action build-release-candidate <runner@fv-az178-704>
Co-authored-by: James Lott <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az375-32>
Co-authored-by: Github Action build-release-candidate <runner@fv-az163-886>
Co-authored-by: Github Action build-release-candidate <runner@fv-az243-75>
Co-authored-by: Github Action build-release-candidate <runner@fv-az90-455>
Co-authored-by: Github Action build-release-candidate <runner@fv-az246-133>
Co-authored-by: Github Action build-release-candidate <runner@fv-az252-216>
Co-authored-by: Github Action build-release-candidate <runner@fv-az133-360>
Co-authored-by: Github Action build-release-candidate <runner@fv-az180-926>
Co-authored-by: Github Action build-release-candidate <runner@fv-az90-628>
Co-authored-by: Github Action build-release-candidate <runner@fv-az108-244>
Co-authored-by: Github Action build-release-candidate <runner@fv-az131-257>
Co-authored-by: Github Action build-release-candidate <runner@fv-az462-391>
Co-authored-by: Laurie <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az292-582>
Co-authored-by: Github Action build-release-candidate <runner@fv-az435-93>
Co-authored-by: Github Action build-release-candidate <runner@fv-az201-924>
Co-authored-by: Charlie Costanzo <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: Laurie Merrell <[email protected]>
Co-authored-by: Kegan Maher <[email protected]>
Co-authored-by: evansiroky <[email protected]>
Co-authored-by: Mjumbe Poe <[email protected]>
Co-authored-by: tiffanychu90 <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: Charles Costanzo <[email protected]>
Co-authored-by: Github Action build-release-candidate <runner@fv-az123-804>
Co-authored-by: Angela Tran <[email protected]>
Co-authored-by: natam1 <[email protected]>
Co-authored-by: …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants