-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: dbt run_results from cadet are still inaccurate with regard to indicating when data were last updated #386
Comments
What's in the S3 bucket
Example run results for 2024-12-11run_time=2024-12-11T01:02:54Results look like this, with timings for compile and execute {
"status": "pass",
"timing": [
{
"name": "compile",
"started_at": "2024-12-11T00:47:25.772206Z",
"completed_at": "2024-12-11T00:47:30.413884Z"
},
{
"name": "execute",
"started_at": "2024-12-11T00:47:30.546430Z",
"completed_at": "2024-12-11T00:49:02.331640Z"
}
],
"thread_id": "Thread-60",
"execution_time": 96.93309283256531,
"adapter_response": {
"_message": "OK -1",
"code": "OK",
"rows_affected": -1,
"data_scanned_in_bytes": 138933058
},
"message": null,
"failures": 0,
"unique_id": "test.mojap_derived_tables.unique_avature_stg__stg_job_workflow_actions_job_workflow_action_id.a14e7af046",
"compiled": true,
},
|
Run results files can be produced by build compile docs generate run seed snapshot test run-operation https://docs.getdbt.com/reference/artifacts/run-results-json |
deploy docs: prod/run_artefacts/run_time=2024-12-03T07:10:08
it does look like this is overwriting the compile/execute dates for everything. Could CaDeT upload these run results with a different prefix so we can filter them out? It doesn't seem like there is an obvious way to distinguish |
Deploy dbt project (experimental) run_time=2024-12-12T05:18:19This one has an empty results list
|
List of workflows that write to prod/run_artefacts:
|
Suggested approach:
|
CaDeT is now storing the run artefacts with the prefix, but we have old artefacts remaining from runs between the 8th and 14th of Jan. I don't think this matters though because our script only looks at artefacts added within the last day.
|
There is one more issue: until the 14th we were falsely recording runs in DataHub, and these haven't been removed automatically. So we probably want to wipe run information before the 14th January. The actual data recorded in DataHub is shown in the runs tab, e.g. https://datahub-catalogue-prod.apps.live.cloud-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:dbt,cadet.awsdatacatalog.xhibit.case_on_list,PROD)/Runs?is_lineage_mode=false I initially thought there was a separate issue with the runs not being ordered properly, but this is working as expected actually. The code that does this is in the EntityRunsResolver. |
We recently added a step to our cadet ingestion workflow to populate the run_result_paths list in the ingestion recipe with a more comprehensive set of files for the run results got from s3 and hence fuller coverage of run results produced, see #373
However it now appears that models not run for a long time are being falsely noted as having run very recently.
A quick investigation showed this to be the run results file produced by the deploy dbt docs workflow: https://github.com/moj-analytical-services/create-a-derived-table/actions/workflows/deploy-docs.yml
It would be worth documenting a fuller review of the run results files to establish if this is the only cause of the error.
Then once the cause is properly established we'll need to think of an approach to workaround the issue so we can use the last execution date from the run results file in find moj data
The text was updated successfully, but these errors were encountered: