-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data processed info into dbt run logs for all statement types #2526
Comments
Very cool idea! The code for printing out bytes billed is fairly straightforward: I think it would just be a matter of reporting the @alepuccetti Is this something you'd be interested in working on? I think this could be a great opportunity for a first contribution :) |
@jtcohen6 Yes, I would be interested. You can assigned to me if this is the procedure. |
It's all yours! Happy to help if you want more direction or get stuck |
Hi Guys, While the bytes processed is an interesting KPI for those bigquery customers that are on an on-demand pricing, the average slot usage of a query is interesting for those customers on a slot commitment. It is very similar to calculate from the query job object you get when launching a query: you have to calculate the duration of the query in milliseconds using Could we add this info in the same PR, or do we create a new issue for this? |
@bodschut Thanks for the suggestion! We're packing a lot of info into limited horizontal space right now. Your point has crystallized for me that we should give the user more control. Here's what makes sense to me: a set of additional configuration parameters in config:
bytes_processed: true | false
slot_duration: true | false @drewbanin @alepuccetti What do you think here? I know this throws one more wrench into #2530, though I feel a lot better knowing that we'd be:
|
@jtcohen6 Having the user configure it is a good idea. I second that. I am up for implementing the configuration parameters, might take a while, I am doing this in my spare time. About the average slots usage. I had a discussion with google engineers on how to measure slots utilisation to estimate if it is worth it moving to flat-rate pricing and how to evaluate the performance implications. The answer was that there is no formula but only try your workload and look how it performs. I am in favour having a slots metric but I think it should be discussed more of what it should be and how to interpret it in a separate issue. For instance, I think that even just |
Hi @alepuccetti You are right to say that the slot usage is not the only parameter to consider a move to a slot commitment. But, it's an interesting indicator when you are on a slot commitment and need to optimize your slot reservations for certain projects. At that point, knowing which queries consume how many slots on average can be interesting to optimize or to decide on changing a certain slot reservation or even upgrading your commitment with another package of 500 slots. You could indeed include the slot millis like that, without calculating the query duration, but the average slot usage as calculated above is the best indicator for the 'slot load' of the query. I don't know if the model duration that dbt logs by default is the actual query duration or whether it includes the setup time. If it is the actual query duration, you could indeed calculate the average slots used yourself. |
Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :) I think we should keep the focus narrow for this particular PR, but in the future, I would like to:
What do y'all think? |
@drewbanin This is a great idea. I am not sure I will be able to implement something like that with my limited knowledge of dbt internals.
@drewbanin This is a great idea. |
@bodschut I like the idea of having a slot usage metrics. I just think it needs more discussion. Because it is not straightforward like number of rows or byte processed. |
Agreed, let's move it to another discussion. But with the suggestion of @drewbanin it could all be included: if you have a job object in your variable scope, you can output the slot usage using macro's for example. |
Hi there, Thanks again, |
@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there. |
Agree, I will answer there. |
Thanks all. Much appreciated |
Describe the feature
When using BigQuery
dbt run
logs should show the data scan value for create table statement and not just for script.Describe alternatives you've considered
No alternative
Additional context
This specific improvement is about BigQuery. However, this might apply for other database too I do not have experience to give specific examples.
Who will this benefit?
Everyone using dbt to get easier access to usage.
The text was updated successfully, but these errors were encountered: