-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support BigQuery Job Tags and Labels #2483
Comments
This is interesting! Thanks for the detailed writeup @boxysean. More so than labels on tables and columns, this idea feels like a close relative of query comments, which we introduced in v0.15. Parsing out and aggregating query comments is the best way to calculate usage on Postgres/Redshift. Clarifying questionsA single You mention that you'd want to have control of the label when invoking dbt from, say, Airflow or a cron script. Is this something you expect as a CLI arg, e.g.
Or could it be version-controlled code that, depending on situational variables ( My proposalHere's what makes sense to me:
|
The proposal is reasonable and I experimented with it. However, I found there is a maximum length of a label value in GCP of 128 bytes, and the default comment exceeded it. I'm not sure it makes sense to use the same query-comment functionality. Alternative:
The code changes in DBT are tiny -- I'm putting together a pullrequest for it. |
@mescanne Thanks for checking out this issue, and for opening up the PR. I'll take a look there in a moment. I had envisioned splitting out the arguments currently passed in the query comment to each be their own label, so one for each of I quite like your suggestion to query |
Hi @jtcohen6 coming back to this. I think it's helpful to add as much information as possible to the query jobs, so your I would still prefer to specify the BigQuery Job label at dbt run-time, such as you suggested with |
I would also like to upvote this feature request. In our use case, we would like to pass other environment variables to the bq-job label for ease of monitoring and benchmarking. For example, we would like to pass the environment (prod, stage), pipeline release version, and our orchestration workflow run id (prefect run id) to BQ job label. Currently we set these info in environment variable, and it's quite easy to add them into the query-comment dict. I think @jtcohen6 's proposal below would make it work for us if implemented.
|
We are in the exact same situation than those described by @boxysean and @hui-zheng: we'd like to stamp our dbt's big query jobs with labels/tags so that we can get better visibility on how we spend our BQ budget/quota. Although per This looks to be supported and achievable through BigQuery's adding job label feature. It is also documented in the JobConfiguration API doc. |
@hui-zheng, could you describe how you do it? |
My team was looking for exactly this feature. Are you planning to work on it, or would you review a PR that adds a |
This isn't something we're prioritizing now. FYI #2809 did add I agree that dbt should be able to pass more information than just the I do think the best version of this would make available the full query comment context as per-node job labels. That context, available to the The string version of this comment—the default value, the string passed to the config, or the value returned by the custom macro—is available to the connection manager, via So here's what I'm thinking about:
Having written all that out, acknowledging that there are a few tricky pieces, I do think the requisite changes would be relatively self-contained in the codebase. Would anyone be interested in giving it a go? |
I'm not very familiar with the dbt internals, so it would probably take me some time to figure out, but I'd be happy to give this a try if nobody picks it up first. |
Happy to help along the way @jmcarp! I think the methods I linked above would be the right places to start. In particular, And set query_comment = self._add_query_comment('') Then, try to parse There would be some additional work to make this an optional config, and to turn it on/off based on |
Thanks for the detailed explanations @jtcohen6, I can give it a try too 🖐️ |
Describe the feature
I would like to be able to control tagging and labeling of BigQuery Jobs as I run dbt on BigQuery.
A similar (but not the same) issue is #1947, for labeling BigQuery Tables and Datasets. This issue focuses on BigQuery Jobs (such as Insert Jobs or Query Jobs).
Describe alternatives you've considered
It's not possible to label or tag jobs after they have started. From the docs
Additional context
The main reason why one would tag and label their BigQuery Job is to analyze BigQuery spend. For example, if one were able to link a BigQuery Job to a certain Airflow operator run (or similar -- in my case a python script run by a cron! :-D) then a real dollar value can be put on running that operator over time.
I think it's important to give the developer control on what tags and labels can be added, so it supports their data processing setup. And so I think tags and labels should be able to be set at launch-time. (In my case, I run a python script that calls
dbt run
-- I would want my python script to be able to set the BigQuery Job tags and labels, while the Jobs are ultimately launched bydbt run
.)Who will this benefit?
Folks who are responsible for their BigQuery spend should benefit by using relevant Job tags and labels.
Thanks!
The text was updated successfully, but these errors were encountered: