-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace information_schema queries with faster alternatives on Snowflake #1877
Comments
+1 to this issue as it's a huge scalability concern for us as well. If we can modify this model metadata scan process to use the thread count in our profiles.yml file, it would be a great improvement. |
Hey @pedromachados - I'm not sure that we'll want to start with parallelizing these queries - I'd be much more in favor of using I think we discussed this on Slack, but there are some real challenges we'd need to account for in using For one, the For two, I'm super keen to male this change - going to queue it up for a future release. |
@drewbanin Thanks for looking into this. What if you run If you give me some pointers on where to go in the code I could take a stab at this and create a PR over the next couple of weeks. |
@drewbanin |
Thanks @pedromachados! I didn't know about The thing I like about |
Hi @drewbanin Here is what they told me:
Based on this, when inspecting schemas, how do you feel about running one of the fast commands first and if you get 10k rows, you run a query to the |
@drewbanin is it possible to use equality predicate instead of |
@nehiljain I did some timing work on this front a while back and I actually didn't notice any appreciable timing difference between |
When you execute "dbt run", doesn't it run the same information_schema query for every model? e.g.:
These queries are taking minutes for us. Why not just pull the information_schema.columns into memory once and filter it using python? Or even batching it if it's a very large table. Alternatively, what if dbt built a model at the beginning of the DAG that selected information schema columns like so: "select * from information_schema.columns"? If dbt gave us an option in project.yml to specificy source table of information schema, we could create a table model pointing to it and then dbt could run using that. |
@drewbanin I just has a call with Snowflake engineers and they told me that the metadata engine does not respect SQL Query Order of Execution so |
Thanks for the info @nehiljain. Are you able to run some benchmarking queries? I'd be curious how the performance of the information_schema queries that dbt runs changes if you replace This is an important thing to check, as Snowflake have made similar claims in the past, but I didn't notice any huge performance difference between |
hey @jtalmi - different
dbt fetches information about tables and views on a schema basis. If dbt needs to know if a table called For incremental models, dbt needs to know the columns present in the destination incremental table. dbt uses this information to build an insert statement that maps the columns in the model SQL query to the columns in the physical table in the database. We could certainly cache these results in memory as well, but in practice, this doesn't work so well. When we've tried to do things like this in the past, we've run into opaque Snowflake errors indicating that the query has returned too much data (see: #1127). So, Snowflake really puts us in a pretty rough spot here. If we try to fetch data more aggressively and cache the results, we might see I like you're idea of copying the information schema at startup. I just fear that this will fail for some Snowflake users in opaque ways with basically no recourse. I'll also say: sometimes dbt needs to run SQL to build tables, then check the types of the columns in that new table. Some sort of global cache that's populated at startup would unfortunately not help us there. I really do appreciate your input here! If you have any other thoughts/questions/comments, I'd love to hear them! |
@drewbanin do the If this is the only obstacle, do you think a workaround could be to run |
Hey y'all, are any of you able to check out this (very developmental) branch? #1999 I gave this a spin on our own Snowflake instance and the results weren't super inspiring (it shaved 2s of a 12s runtime, which isn't nothing!) but I'm also not seeing multiple-minute delays for the information schema queries here either. Please check out the description in the PR for information about the caveats of this branch. It will not work if your databases or schema names require quotes, or if you have > 10,000 tables in a single schema. If you're able to run this locally for your dbt projects, please tell me about any performance benefit you observe in the PR comments! Let me know if you have any questions, or if you need a hand with running dbt from this branch. |
Some updates:
I'm going to close this issue out in favor of more discrete, actionable issues, but I want to thank everyone for contributing on this thread! Happy to continue the discussion below if anyone has further thoughts :) |
@jtcohen6 I noticed that |
@matt-winkler We can hit the information schema once per schema/database, rather than running a separate Longer-term, I want to see us getting away from massive batch queries run during the |
Describe the feature
When dbt starts, it runs a query to the
information_schema
for every schema in the project. This happens even if the run involves a single model (single schema).Each of these queries is taking anywhere from 4-20 seconds, presumably depending on how much load the overall Snowflake system has across accounts.
These queries seem to be running on the main thread and are therefore sequential. We have a project with 9 schemas with a time-to-first-model of close to 90 seconds. As you can imagine, this is a huge productivity drag.
We are contacting Snowflake about speeding up
information_schema
queries but this could also be improved if dbt ran these queries in multiple threads and if it only ran queries for the schemas involved in the given run.Also, I believe the
show tables
orshow views
commands could be used in this particular case (these take in the order of 100-200 ms) instead of queries to the information schema.Below is one of these queries which took over 12 seconds:
Describe alternatives you've considered
I inquired whether a macro could be used to override the information schema queries but was told it's not possible.
Additional context
Snowflake
Who will this benefit?
This will speed time-to-first-model for Snowflake projects with multiple schemas
The text was updated successfully, but these errors were encountered: