-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to skip relation cache population #7307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works well in local testing! This makes a difference of several seconds for interactive compile on non-local DWHs.
import time
from dbt.cli.main import dbtRunner
dbt = dbtRunner()
for populate_cache in [True, False]:
start = time.perf_counter()
results, success = dbt.invoke(['compile', '--select', 'my_model'], populate_cache=populate_cache)
end = time.perf_counter() - start
print(f"With populate_cache: {populate_cache}, elapsed: {end}")
print(results[0].node.compiled_code)
e.g. on Snowflake:
With populate_cache: True, elapsed: 2.3116801250000094
select 1::text as id
With populate_cache: False, elapsed: 0.7475685420000104
select 1::text as id
if not self.args.populate_cache: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me!
out of curiosity - is there no real difference between self.args.populate_cache
and get_flags().POPULATE_CACHE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real difference, no. self.args
is Flags
which we should be using much more where we can. I think the places we useget_flags
instead of self.args
was just to get the click feature branch over the line to be merged.
core/dbt/adapters/base/impl.py
Outdated
# Jeremy: what was the intent behind this inner loop? | ||
# cache_update: Set[Tuple[Optional[str], Optional[str]]] = set() | ||
# for relation in cache_schemas: | ||
# cache_update.add((database, schema)) | ||
|
||
self.cache.update_schemas(set((database, schema))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 While pairing with @ChenyuLInx we weren't sure what this inner loop was doing, especially since cache_schemas
isn't defined anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-k The cache is keyed on database
+ schema
. When dbt does a cache lookup, it asks: Have I already cached this database
+ schema
combo? If the combo isn't present as a key in the cache, it's a cache miss, and dbt needs to go run a query. If those keys are present, and no relations are found, then the assumption is that the schema is empty (= missing from the database), rather than just missing from the cache.
If there are no relations
returned by the query, we still want to record that, by inserting an empty set for each database
+ schema
pair. That way, the next time dbt wants to know if any relations are in that database.schema
, it doesn't need to run the same query over again — the cache says, I already know there aren't any relations there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, but I do not understand the intent of your inner loop I've commented on here.
cache_update: Set[Tuple[Optional[str], Optional[str]]] = set()
for relation in cache_schemas:
cache_update.add((database, schema))
self.cache.update_schemas(cache_update)
What is this code trying to accomplish? It is difficult to determine because cache_schemas
isn't defined in your possible implementation on the original bug ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm not sure :) can't remember what I was thinking when I wrote that code several months ago.
Given that this list_relations
method is only ever being called for one database
+ schema
pair at a time, it makes sense to just add that [(database, schema)]
pair once!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool, I think I have it in a good state right now.
f17bbac
to
b621230
Compare
@@ -718,6 +718,18 @@ def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelat | |||
# we can't build the relations cache because we don't have a | |||
# manifest so we can't run any operations. | |||
relations = self.list_relations_without_caching(schema_relation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 this actually means we are still retrieving all relations under a schema even if we are only running one models. Just at a later time.
It is better than before since compile now doesn't cache all relations. I am happy that this part doesn't change in this PR, but this is probably something we should consider another approach sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenyuLInx Good callout, and worth rethinking in the future. For now, this schema-level behavior is baked into how the cache works:
- We run one caching query per
database.schema
, and up to a certain size, the slow bit is running that query in the DWH versus loading more information than strictly necessary into memory - We organize the cache on the basis of
database.schema
, and do all our lookups on that basis. If we trimmed down the relations we were looking for, we'd risk a false negative, where we think a relation isn't present in the schema but it actually is.
I think something like this is definitely worth doing as a shorter-term win for "catalog" queries run during docs generate
:
cf15815
to
eb53272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, also tested with dbt-server
compile endpoint
resolves #6526
Description
Add a
--populate-cache
flag to optionally skip relation cache population, defaults toTrue
.Checklist
changie new
to create a changelog entry