Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

LavMatt · 2024-11-29T11:22:47Z

We recently added a step to our cadet ingestion workflow to populate the run_result_paths list in the ingestion recipe with a more comprehensive set of files for the run results got from s3 and hence fuller coverage of run results produced, see #373

However it now appears that models not run for a long time are being falsely noted as having run very recently.

A quick investigation showed this to be the run results file produced by the deploy dbt docs workflow: https://github.com/moj-analytical-services/create-a-derived-table/actions/workflows/deploy-docs.yml

It would be worth documenting a fuller review of the run results files to establish if this is the only cause of the error.

Then once the cause is properly established we'll need to think of an approach to workaround the issue so we can use the last execution date from the run results file in find moj data

MatMoore · 2024-12-12T15:59:15Z

What's in the S3 bucket

there are 37 artefacts with prefix prod/run_artefacts/run_time=2024-11-28
there are 30 artefacts with prefix prod/run_artefacts/run_time=2024-12-11

Example run results for 2024-12-11

run_time=2024-12-11T01:02:54

Results look like this, with timings for compile and execute

        {
            "status": "pass",
            "timing": [
                {
                    "name": "compile",
                    "started_at": "2024-12-11T00:47:25.772206Z",
                    "completed_at": "2024-12-11T00:47:30.413884Z"
                },
                {
                    "name": "execute",
                    "started_at": "2024-12-11T00:47:30.546430Z",
                    "completed_at": "2024-12-11T00:49:02.331640Z"
                }
            ],
            "thread_id": "Thread-60",
            "execution_time": 96.93309283256531,
            "adapter_response": {
                "_message": "OK -1",
                "code": "OK",
                "rows_affected": -1,
                "data_scanned_in_bytes": 138933058
            },
            "message": null,
            "failures": 0,
            "unique_id": "test.mojap_derived_tables.unique_avature_stg__stg_job_workflow_actions_job_workflow_action_id.a14e7af046",
            "compiled": true,
        },

> jq '.results[].unique_id' run_results\ \(1\).json |wc -l
     285

MatMoore · 2024-12-12T16:15:41Z

Run results files can be produced by build compile docs generate run seed snapshot test run-operation

https://docs.getdbt.com/reference/artifacts/run-results-json

MatMoore · 2024-12-12T16:22:05Z

deploy docs: prod/run_artefacts/run_time=2024-12-03T07:10:08

> jq '.results[].unique_id' run_results\ \(2\).json |wc -l
   18454

it does look like this is overwriting the compile/execute dates for everything. Could CaDeT upload these run results with a different prefix so we can filter them out? It doesn't seem like there is an obvious way to distinguish dbt build run results files and dbt docs results files within the bucket.

MatMoore · 2024-12-12T16:29:58Z

Deploy dbt project (experimental) run_time=2024-12-12T05:18:19

This one has an empty results list

{
    "metadata": {
        "dbt_schema_version": "https://schemas.getdbt.com/dbt/run-results/v6.json",
        "dbt_version": "1.8.3",
        "generated_at": "2024-12-12T05:17:58.401986Z",
        "invocation_id": "947c2b31-729d-4d59-a8b8-d8b340f02c93",
        "env": {}
    },
    "results": [],

MatMoore · 2024-12-12T16:40:46Z

List of workflows that write to prod/run_artefacts:

deploy-avature-hourly.yml
deploy-basm-daily.yml
deploy-bold-daily-prod.yml
deploy-case-info-dashboard_daily.yml
deploy-caseman.yml
deploy-curated-daily.yml
deploy-curated-prod.yml
deploy-daily.yml
deploy-docs.yml
deploy-em-silver-prod.yml
deploy-experimental.yml
deploy-monthly.yml
deploy-nomis-daily.yml
deploy-opg.yml
deploy-prison-population.yaml
deploy-prod-digital-prisons-reporting.yml
deploy-seeds.yml
deploy-weekly.yml
deploy-xhibit.yml

MatMoore · 2024-12-17T09:11:53Z

Suggested approach:

In CaDeT, Add a prefix to run result uploads based on job name (they are already divided by environment)
In our script, exclude run results from deploy_docs

MatMoore · 2025-01-23T15:33:23Z

CaDeT is now storing the run artefacts with the prefix, but we have old artefacts remaining from runs between the 8th and 14th of Jan.

I don't think this matters though because our script only looks at artefacts added within the last day.

However, despite filtering the run results, I'm now seeing last_datajob_run_date being set to 09 Jan 2025 which doesn't align with the create-a-derived-table repo. So something is not working as we expect. 👈🏻 This was due to me checking the wrong environment.

MatMoore · 2025-01-23T16:59:00Z

There is one more issue: until the 14th we were falsely recording runs in DataHub, and these haven't been removed automatically. So we probably want to wipe run information before the 14th January.

The actual data recorded in DataHub is shown in the runs tab, e.g. https://datahub-catalogue-prod.apps.live.cloud-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:dbt,cadet.awsdatacatalog.xhibit.case_on_list,PROD)/Runs?is_lineage_mode=false

I initially thought there was a separate issue with the runs not being ordered properly, but this is working as expected actually. The code that does this is in the EntityRunsResolver.

LavMatt added this to Data Catalogue Nov 29, 2024

github-project-automation bot moved this to Todo 📝 in Data Catalogue Nov 29, 2024

LavMatt changed the title ~~dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated~~ Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated Nov 29, 2024

LavMatt added the bug Something isn't working label Nov 29, 2024

MatMoore self-assigned this Dec 12, 2024

MatMoore moved this from Todo 📝 to In Progress 🚀 in Data Catalogue Dec 17, 2024

MatMoore moved this from In Progress 🚀 to Review 🛂 in Data Catalogue Dec 17, 2024

MatMoore moved this from Review 🛂 to Blocked 🚫 in Data Catalogue Dec 18, 2024

YvanMOJdigital moved this from Blocked 🚫 to Review 🛂 in Data Catalogue Jan 23, 2025

MatMoore moved this from Review 🛂 to In Progress 🚀 in Data Catalogue Jan 23, 2025

MatMoore moved this from In Progress 🚀 to Review 🛂 in Data Catalogue Jan 24, 2025

MatMoore moved this from Review 🛂 to Done ✅ in Data Catalogue Jan 28, 2025

MatMoore closed this as completed by moving to Done ✅ in Data Catalogue Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

LavMatt commented Nov 29, 2024 •

edited

Loading

MatMoore commented Dec 12, 2024 •

edited

Loading

MatMoore commented Dec 12, 2024

MatMoore commented Dec 12, 2024

MatMoore commented Dec 12, 2024

MatMoore commented Dec 12, 2024

MatMoore commented Dec 17, 2024

MatMoore commented Jan 23, 2025 •

edited

Loading

MatMoore commented Jan 23, 2025 •

edited

Loading

Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

Comments

LavMatt commented Nov 29, 2024 • edited Loading

MatMoore commented Dec 12, 2024 • edited Loading

What's in the S3 bucket

Example run results for 2024-12-11

run_time=2024-12-11T01:02:54

MatMoore commented Dec 12, 2024

MatMoore commented Dec 12, 2024

deploy docs: prod/run_artefacts/run_time=2024-12-03T07:10:08

MatMoore commented Dec 12, 2024

Deploy dbt project (experimental) run_time=2024-12-12T05:18:19

MatMoore commented Dec 12, 2024

MatMoore commented Dec 17, 2024

MatMoore commented Jan 23, 2025 • edited Loading

MatMoore commented Jan 23, 2025 • edited Loading

LavMatt commented Nov 29, 2024 •

edited

Loading

MatMoore commented Dec 12, 2024 •

edited

Loading

MatMoore commented Jan 23, 2025 •

edited

Loading

MatMoore commented Jan 23, 2025 •

edited

Loading