Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: filter Benefits bot events #3547

Merged
merged 1 commit into from
Nov 13, 2024
Merged

Fix: filter Benefits bot events #3547

merged 1 commit into from
Nov 13, 2024

Conversation

thekaveman
Copy link
Member

@thekaveman thekaveman commented Nov 13, 2024

Description

TLDR; we can filter out about 26.5 million records (of roughly 27 million total!) from the raw Amplitude data that we don't need in the final warehouse fact table / model.

Full details in Slack thread: https://cal-itp.slack.com/archives/C037Y3UE71P/p1731533304019569

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Before this change

$ poetry run dbt run -s +fct_benefits_events
22:04:44  Running with dbt=1.5.1
22:04:46  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
22:04:46  Found 422 models, 963 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 174 sources, 4 exposures, 0 metrics, 0 groups
22:04:46  
22:04:49  Concurrency: 8 threads (target='dev')
22:04:49  
22:04:49  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
22:04:51  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.14s]
22:04:51  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
22:05:05  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (26.9m rows, 73.1 GiB processed) in 14.09s]
22:05:05  
22:05:05  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 18.59 seconds (18.59s).
22:05:05  
22:05:05  Completed successfully
22:05:05  
22:05:05  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Note: CREATE TABLE (26.9m rows, 73.1 GiB processed) in 14.09s

With this change

$ poetry run dbt run -s +fct_benefits_events
22:06:50  Running with dbt=1.5.1
22:06:52  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
22:06:52  Found 422 models, 963 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 174 sources, 4 exposures, 0 metrics, 0 groups
22:06:52  
22:06:56  Concurrency: 8 threads (target='dev')
22:06:56  
22:06:56  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
22:06:57  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.13s]
22:06:57  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
22:07:13  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (402.1k rows, 73.1 GiB processed) in 15.43s]
22:07:13  
22:07:13  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 20.37 seconds (20.37s).
22:07:13  
22:07:13  Completed successfully
22:07:13  
22:07:13  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Note: CREATE TABLE (402.1k rows, 73.1 GiB processed) in 15.43s

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • No action required
  • Actions required (specified below)

between Aug 2022 - Feb 2023 we unnecessarily captured events from health probes
resulting in a massive spike of nearly 26 million useless events

this change filters out events generated from the health probes from the final
model table
@thekaveman thekaveman marked this pull request as ready for review November 13, 2024 22:15
@thekaveman thekaveman self-assigned this Nov 13, 2024
Copy link

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

@angela-tran
Copy link
Member

@thekaveman Just curious, is it expected that the storage size of fct_benefits_events is still 7.8 GB? (at least according to #3547 (comment)...)

Copy link
Member

@angela-tran angela-tran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change makes sense to me

@thekaveman
Copy link
Member Author

@thekaveman Just curious, is it expected that the storage size of fct_benefits_events is still 7.8 GB? (at least according to #3547 (comment)...)

Yeah I have no idea what that means / represents. I guess I thought it would go down too... but 🤷

@thekaveman thekaveman merged commit 6e506c8 into main Nov 13, 2024
4 checks passed
@thekaveman thekaveman deleted the fix/benefits-bot-events branch November 13, 2024 23:26
@thekaveman thekaveman linked an issue Nov 13, 2024 that may be closed by this pull request
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Further analytics updates for Metabase pipeline
2 participants