Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386

Closed
LavMatt opened this issue Nov 29, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@LavMatt
Copy link
Contributor

LavMatt commented Nov 29, 2024

We recently added a step to our cadet ingestion workflow to populate the run_result_paths list in the ingestion recipe with a more comprehensive set of files for the run results got from s3 and hence fuller coverage of run results produced, see #373

However it now appears that models not run for a long time are being falsely noted as having run very recently.

A quick investigation showed this to be the run results file produced by the deploy dbt docs workflow: https://github.com/moj-analytical-services/create-a-derived-table/actions/workflows/deploy-docs.yml

It would be worth documenting a fuller review of the run results files to establish if this is the only cause of the error.

Then once the cause is properly established we'll need to think of an approach to workaround the issue so we can use the last execution date from the run results file in find moj data

@github-project-automation github-project-automation bot moved this to Todo 📝 in Data Catalogue Nov 29, 2024
@LavMatt LavMatt changed the title dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated Nov 29, 2024
@LavMatt LavMatt added the bug Something isn't working label Nov 29, 2024
@MatMoore MatMoore self-assigned this Dec 12, 2024
@MatMoore
Copy link
Contributor

MatMoore commented Dec 12, 2024

What's in the S3 bucket

  • there are 37 artefacts with prefix prod/run_artefacts/run_time=2024-11-28
  • there are 30 artefacts with prefix prod/run_artefacts/run_time=2024-12-11

Example run results for 2024-12-11

run_time=2024-12-11T01:02:54

Results look like this, with timings for compile and execute

        {
            "status": "pass",
            "timing": [
                {
                    "name": "compile",
                    "started_at": "2024-12-11T00:47:25.772206Z",
                    "completed_at": "2024-12-11T00:47:30.413884Z"
                },
                {
                    "name": "execute",
                    "started_at": "2024-12-11T00:47:30.546430Z",
                    "completed_at": "2024-12-11T00:49:02.331640Z"
                }
            ],
            "thread_id": "Thread-60",
            "execution_time": 96.93309283256531,
            "adapter_response": {
                "_message": "OK -1",
                "code": "OK",
                "rows_affected": -1,
                "data_scanned_in_bytes": 138933058
            },
            "message": null,
            "failures": 0,
            "unique_id": "test.mojap_derived_tables.unique_avature_stg__stg_job_workflow_actions_job_workflow_action_id.a14e7af046",
            "compiled": true,
        },
> jq '.results[].unique_id' run_results\ \(1\).json |wc -l
     285

@MatMoore
Copy link
Contributor

Run results files can be produced by build compile docs generate run seed snapshot test run-operation

https://docs.getdbt.com/reference/artifacts/run-results-json

@MatMoore
Copy link
Contributor

deploy docs: prod/run_artefacts/run_time=2024-12-03T07:10:08

> jq '.results[].unique_id' run_results\ \(2\).json |wc -l
   18454

it does look like this is overwriting the compile/execute dates for everything. Could CaDeT upload these run results with a different prefix so we can filter them out? It doesn't seem like there is an obvious way to distinguish dbt build run results files and dbt docs results files within the bucket.

@MatMoore
Copy link
Contributor

Deploy dbt project (experimental) run_time=2024-12-12T05:18:19

This one has an empty results list

{
    "metadata": {
        "dbt_schema_version": "https://schemas.getdbt.com/dbt/run-results/v6.json",
        "dbt_version": "1.8.3",
        "generated_at": "2024-12-12T05:17:58.401986Z",
        "invocation_id": "947c2b31-729d-4d59-a8b8-d8b340f02c93",
        "env": {}
    },
    "results": [],

@MatMoore
Copy link
Contributor

List of workflows that write to prod/run_artefacts:

  • deploy-avature-hourly.yml
  • deploy-basm-daily.yml
  • deploy-bold-daily-prod.yml
  • deploy-case-info-dashboard_daily.yml
  • deploy-caseman.yml
  • deploy-curated-daily.yml
  • deploy-curated-prod.yml
  • deploy-daily.yml
  • deploy-docs.yml
  • deploy-em-silver-prod.yml
  • deploy-experimental.yml
  • deploy-monthly.yml
  • deploy-nomis-daily.yml
  • deploy-opg.yml
  • deploy-prison-population.yaml
  • deploy-prod-digital-prisons-reporting.yml
  • deploy-seeds.yml
  • deploy-weekly.yml
  • deploy-xhibit.yml

@MatMoore MatMoore moved this from Todo 📝 to In Progress 🚀 in Data Catalogue Dec 17, 2024
@MatMoore
Copy link
Contributor

Suggested approach:

  1. In CaDeT, Add a prefix to run result uploads based on job name (they are already divided by environment)
  2. In our script, exclude run results from deploy_docs

@MatMoore MatMoore moved this from In Progress 🚀 to Review 🛂 in Data Catalogue Dec 17, 2024
@MatMoore MatMoore moved this from Review 🛂 to Blocked 🚫 in Data Catalogue Dec 18, 2024
@YvanMOJdigital YvanMOJdigital moved this from Blocked 🚫 to Review 🛂 in Data Catalogue Jan 23, 2025
@MatMoore
Copy link
Contributor

MatMoore commented Jan 23, 2025

CaDeT is now storing the run artefacts with the prefix, but we have old artefacts remaining from runs between the 8th and 14th of Jan.

I don't think this matters though because our script only looks at artefacts added within the last day.

However, despite filtering the run results, I'm now seeing last_datajob_run_date being set to 09 Jan 2025 which doesn't align with the create-a-derived-table repo. So something is not working as we expect. 👈🏻 This was due to me checking the wrong environment.

@MatMoore MatMoore moved this from Review 🛂 to In Progress 🚀 in Data Catalogue Jan 23, 2025
@MatMoore
Copy link
Contributor

MatMoore commented Jan 23, 2025

There is one more issue: until the 14th we were falsely recording runs in DataHub, and these haven't been removed automatically. So we probably want to wipe run information before the 14th January.

The actual data recorded in DataHub is shown in the runs tab, e.g. https://datahub-catalogue-prod.apps.live.cloud-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:dbt,cadet.awsdatacatalog.xhibit.case_on_list,PROD)/Runs?is_lineage_mode=false

I initially thought there was a separate issue with the runs not being ordered properly, but this is working as expected actually. The code that does this is in the EntityRunsResolver.

@MatMoore MatMoore moved this from In Progress 🚀 to Review 🛂 in Data Catalogue Jan 24, 2025
@MatMoore MatMoore moved this from Review 🛂 to Done ✅ in Data Catalogue Jan 28, 2025
@MatMoore MatMoore closed this as completed by moving to Done ✅ in Data Catalogue Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done ✅
Development

No branches or pull requests

2 participants