Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BigQuery Job Tags and Labels #2483

Closed
boxysean opened this issue May 22, 2020 · 12 comments · Fixed by #3145
Closed

Support BigQuery Job Tags and Labels #2483

boxysean opened this issue May 22, 2020 · 12 comments · Fixed by #3145
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Milestone

Comments

@boxysean
Copy link
Contributor

Describe the feature

I would like to be able to control tagging and labeling of BigQuery Jobs as I run dbt on BigQuery.

A similar (but not the same) issue is #1947, for labeling BigQuery Tables and Datasets. This issue focuses on BigQuery Jobs (such as Insert Jobs or Query Jobs).

Describe alternatives you've considered

It's not possible to label or tag jobs after they have started. From the docs

You cannot add labels to or update labels on pending, running, or completed jobs.

Additional context

The main reason why one would tag and label their BigQuery Job is to analyze BigQuery spend. For example, if one were able to link a BigQuery Job to a certain Airflow operator run (or similar -- in my case a python script run by a cron! :-D) then a real dollar value can be put on running that operator over time.

I think it's important to give the developer control on what tags and labels can be added, so it supports their data processing setup. And so I think tags and labels should be able to be set at launch-time. (In my case, I run a python script that calls dbt run -- I would want my python script to be able to set the BigQuery Job tags and labels, while the Jobs are ultimately launched by dbt run.)

Who will this benefit?

Folks who are responsible for their BigQuery spend should benefit by using relevant Job tags and labels.

Thanks!

@boxysean boxysean added enhancement New feature or request triage labels May 22, 2020
@jtcohen6
Copy link
Contributor

This is interesting! Thanks for the detailed writeup @boxysean.

More so than labels on tables and columns, this idea feels like a close relative of query comments, which we introduced in v0.15. Parsing out and aggregating query comments is the best way to calculate usage on Postgres/Redshift.

Clarifying questions

A single dbt run may execute dozens or hundreds query jobs. Is your thought that all query jobs executed by the same dbt command should get the same label(s)? Should each query job label include information about the specific node (model/seed/snapshot/test/etc) it's running?

You mention that you'd want to have control of the label when invoking dbt from, say, Airflow or a cron script. Is this something you expect as a CLI arg, e.g.

dbt run --job-label '{"name": "nightly_run"}'

Or could it be version-controlled code that, depending on situational variables (target, var, env_var) outputs a different label?

My proposal

Here's what makes sense to me:

  • We could add a query-comment attribute, job-label: true. On BigQuery, in addition to prepending the query comment as a SQL comment (/* */), dbt would add the comment's key-value components as labels to each query jobs it runs.
  • Using the existing query-comment functionality, you could write a macro that returns a different comment (or comment dict attributes) depending on the values of target/var/env_var

@mescanne
Copy link
Contributor

mescanne commented Oct 3, 2020

The proposal is reasonable and I experimented with it. However, I found there is a maximum length of a label value in GCP of 128 bytes, and the default comment exceeded it. I'm not sure it makes sense to use the same query-comment functionality.

Alternative:

  • On every DBT BigQuery job it passes in the label dbt_invocation_id matching with the invocation_id in DBT.
  • For cases where more detailed profiling is needed, a DBT step can be added to insert into a tracking table an association between the invocation_id and whatever information is needed. This can include whatever macros you need/want (eg coming from Airflow perhaps, or Argo).
  • The INFORMATION_SCHEMA job information can be used to extract performance stats and left join with the tracking table for more detailed information/analysis.

The code changes in DBT are tiny -- I'm putting together a pullrequest for it.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 6, 2020

@mescanne Thanks for checking out this issue, and for opening up the PR. I'll take a look there in a moment.

I had envisioned splitting out the arguments currently passed in the query comment to each be their own label, so one for each of app, dbt_version, profile_name, target_name, and node_id. That would enable us to avoid the 128 byte limit. A few of those values would require more plumbing than what's currently available to raw_execute, however, so the work required is a bit more involved.

I quite like your suggestion to query INFORMATION_SCHEMA.JOBS_BY_* and filter / join on the invocation_id label as a way to get metadata about a given run, e.g. total spend.

@boxysean
Copy link
Contributor Author

boxysean commented Oct 7, 2020

Hi @jtcohen6 coming back to this. I think it's helpful to add as much information as possible to the query jobs, so your query-comment suggestion makes sense.

I would still prefer to specify the BigQuery Job label at dbt run-time, such as you suggested with dbt run --job-label '{"name": "nightly_run"}'. That way it would be controlled by the executor (e.g., Airflow, cron) and I could customize how I slice & dice the executions, which indeed INFORMATION_SCHEMA.JOBS_BY_* could be used for!

@hui-zheng
Copy link

I would also like to upvote this feature request.

In our use case, we would like to pass other environment variables to the bq-job label for ease of monitoring and benchmarking. For example, we would like to pass the environment (prod, stage), pipeline release version, and our orchestration workflow run id (prefect run id) to BQ job label. Currently we set these info in environment variable, and it's quite easy to add them into the query-comment dict.

I think @jtcohen6 's proposal below would make it work for us if implemented.

I had envisioned splitting out the arguments currently passed in the query comment to each be their own label, so one for each of app, dbt_version, profile_name, target_name, and node_id. That would enable us to avoid the 128 byte limit. A few of those values would require more plumbing than what's currently available to raw_execute, however, so the work required is a bit more involved.

@ncolomer
Copy link

ncolomer commented Feb 22, 2021

We are in the exact same situation than those described by @boxysean and @hui-zheng: we'd like to stamp our dbt's big query jobs with labels/tags so that we can get better visibility on how we spend our BQ budget/quota.

Although per dbt run custom labels/tags would be enough in our case (eg. customer, environment, etc), having additional (generic?) context per query jobs would be great too (eg. dbt model/node the query job is running for as suggested by @jtcohen6, or model/node's own dbt tags).

This looks to be supported and achievable through BigQuery's adding job label feature. It is also documented in the JobConfiguration API doc.

@ncolomer
Copy link

Currently we set these info in environment variable, and it's quite easy to add them into the query-comment dict.

@hui-zheng, could you describe how you do it?

@jmcarp
Copy link
Contributor

jmcarp commented Feb 27, 2021

My team was looking for exactly this feature. Are you planning to work on it, or would you review a PR that adds a --job-label flag to dbt run?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 1, 2021

This isn't something we're prioritizing now. FYI #2809 did add invocation_id as a label to all dbt-bigquery jobs, starting in v0.19.0. That invocation_id can be used to query INFORMATION_SCHEMA.JOBS_BY_* and calculate total time/spend per invocation; it can also be used to associate BigQuery query history with dbt run artifacts (docs), namely run_results.json. Those run artifacts will contain lots of useful metadata, including (e.g.) any environment variables prefixed with DBT_ENV_CUSTOM_ENV_.

I agree that dbt should be able to pass more information than just the invocation_id, though I think that's a strong start. A --job-label flag that allows the user / orchestration tool to set one value for all nodes in an invocation should be straightforward to implement. The more I think about it, though, I find it functionally limiting but also one-off as an implementation, not well integrated with existing dbt constructs that seek to accomplish the same goal.

I do think the best version of this would make available the full query comment context as per-node job labels. That context, available to the query_comment macro, is defined in query_headers.py. I agree with @hui-zheng, it's quite easy to pass environment variables or --vars into the query-comment config or query_comment macro, so this approach would solve for both use cases we've been discussing.

The string version of this comment—the default value, the string passed to the config, or the value returned by the custom macro—is available to the connection manager, via set_query_header and _add_query_comment. The execute method already calls _add_query_comment to prepend the comment to SQL before execution:

https://github.com/fishtown-analytics/dbt/blob/344a14416d22f0cfbeb56b9904092c8a4f38b1fc/plugins/bigquery/dbt/adapters/bigquery/connections.py#L333-L336

So here's what I'm thinking about:

  • How would we enable dbt-bigquery users to turn this on? I'm thinking an additional option, nested under query_comment in dbt_project.yml, called job_label: true | false.
  • If query_comment.job_label is turned on, and the query comment config/macro returns a dict / JSON string (such as the advanced usage example in the docs), should dbt try to parse the returned value into a python dict, and pass each key-value pair as a separate label? I think yes; this should even work for the default query comment value.
  • If query_comment.job_label is turned on, and the query-comment returns an unstructured string, should dbt still try to pass the first 128 bytes (truncated if needed) as the value to a single label called query_comment? I still think yes, but I'm open to your thoughts on this point (and every point above).

Having written all that out, acknowledging that there are a few tricky pieces, I do think the requisite changes would be relatively self-contained in the codebase. Would anyone be interested in giving it a go?

@jtcohen6 jtcohen6 added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Mar 1, 2021
@jmcarp
Copy link
Contributor

jmcarp commented Mar 3, 2021

I'm not very familiar with the dbt internals, so it would probably take me some time to figure out, but I'd be happy to give this a try if nobody picks it up first.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2021

Happy to help along the way @jmcarp! I think the methods I linked above would be the right places to start.

In particular, raw_execute already passes job labels here:

https://github.com/fishtown-analytics/dbt/blob/344a14416d22f0cfbeb56b9904092c8a4f38b1fc/plugins/bigquery/dbt/adapters/bigquery/connections.py#L302-L313

And _add_query_comment is already a method available to the Connection Manager. So this may be as simple as adding within raw_execute:

set query_comment = self._add_query_comment('')

Then, try to parse query_comment is a dict. If it's a dict, pass each key/value pair as a separate item to job_params['labels']. If it's a non-empty string, pass the entire query_comment string (truncated to 128 characters) as one item. If it's empty, do nothing.

There would be some additional work to make this an optional config, and to turn it on/off based on query_comment.job_params, but that feels eminently doable. Probably it looks like creating a special dbt-bigquery subclass of MacroQueryStringSetter and having the BigQueryConnectionManager override set_query_header, unless we can think of comparable functionality on other adapters. (Snowflake has query tags, but those are accessible in SQL, so there's no need to wedge them into the connection.)

@ncolomer
Copy link

ncolomer commented Mar 3, 2021

Thanks for the detailed explanations @jtcohen6, I can give it a try too 🖐️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants