Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip relation cache population #7307

Merged
merged 8 commits into from
Apr 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20230410-115538.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: Add --no-populate-cache to optionally skip relation cache population
time: 2023-04-10T11:55:38.360997-05:00
custom:
Author: stu-k
Issue: "1751"
12 changes: 12 additions & 0 deletions core/dbt/adapters/base/impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -718,6 +718,18 @@ def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelat
# we can't build the relations cache because we don't have a
# manifest so we can't run any operations.
relations = self.list_relations_without_caching(schema_relation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 this actually means we are still retrieving all relations under a schema even if we are only running one models. Just at a later time.

It is better than before since compile now doesn't cache all relations. I am happy that this part doesn't change in this PR, but this is probably something we should consider another approach sometime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx Good callout, and worth rethinking in the future. For now, this schema-level behavior is baked into how the cache works:

  • We run one caching query per database.schema, and up to a certain size, the slow bit is running that query in the DWH versus loading more information than strictly necessary into memory
  • We organize the cache on the basis of database.schema, and do all our lookups on that basis. If we trimmed down the relations we were looking for, we'd risk a false negative, where we think a relation isn't present in the schema but it actually is.

I think something like this is definitely worth doing as a shorter-term win for "catalog" queries run during docs generate:


# if the cache is already populated, add this schema in
# otherwise, skip updating the cache and just ignore
if self.cache:
for relation in relations:
self.cache.add(relation)
if not relations:
# it's possible that there were no relations in some schemas. We want
# to insert the schemas we query into the cache's `.schemas` attribute
# so we can check it later
self.cache.update_schemas([(database, schema)])

fire_event(
ListRelations(
database=cast_to_str(database),
Expand Down
1 change: 1 addition & 0 deletions core/dbt/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ def invoke(self, args: List[str], **kwargs) -> dbtRunnerResult:
@p.log_path
@p.macro_debugging
@p.partial_parse
@p.populate_cache
@p.print
@p.printer_width
@p.quiet
Expand Down
7 changes: 7 additions & 0 deletions core/dbt/cli/params.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,13 @@
default=True,
)

populate_cache = click.option(
"--populate-cache/--no-populate-cache",
envvar="DBT_POPULATE_CACHE",
help="Allow for partial parsing by looking for and writing to a pickle file in the target directory. This overrides the user configuration file.",
default=True,
)

port = click.option(
"--port",
envvar=None,
Expand Down
29 changes: 15 additions & 14 deletions core/dbt/contracts/project.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,25 +243,26 @@ def validate(cls, data):

@dataclass
class UserConfig(ExtensibleDbtClassMixin, Replaceable, UserConfigContract):
send_anonymous_usage_stats: bool = DEFAULT_SEND_ANONYMOUS_USAGE_STATS
use_colors: Optional[bool] = None
use_colors_file: Optional[bool] = None
partial_parse: Optional[bool] = None
printer_width: Optional[int] = None
write_json: Optional[bool] = None
warn_error: Optional[bool] = None
warn_error_options: Optional[Dict[str, Union[str, List[str]]]] = None
cache_selected_only: Optional[bool] = None
stu-k marked this conversation as resolved.
Show resolved Hide resolved
debug: Optional[bool] = None
fail_fast: Optional[bool] = None
indirect_selection: Optional[str] = None
log_format: Optional[str] = None
log_format_file: Optional[str] = None
log_level: Optional[str] = None
log_level_file: Optional[str] = None
debug: Optional[bool] = None
version_check: Optional[bool] = None
fail_fast: Optional[bool] = None
use_experimental_parser: Optional[bool] = None
partial_parse: Optional[bool] = None
populate_cache: Optional[bool] = None
printer_width: Optional[int] = None
send_anonymous_usage_stats: bool = DEFAULT_SEND_ANONYMOUS_USAGE_STATS
static_parser: Optional[bool] = None
indirect_selection: Optional[str] = None
cache_selected_only: Optional[bool] = None
use_colors: Optional[bool] = None
use_colors_file: Optional[bool] = None
use_experimental_parser: Optional[bool] = None
version_check: Optional[bool] = None
warn_error: Optional[bool] = None
warn_error_options: Optional[Dict[str, Union[str, List[str]]]] = None
write_json: Optional[bool] = None


@dataclass
Expand Down
3 changes: 3 additions & 0 deletions core/dbt/task/runnable.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,9 @@ def _mark_dependent_errors(self, node_id, result, cause):
self._skipped_children[dep_node_id] = cause

def populate_adapter_cache(self, adapter, required_schemas: Set[BaseRelation] = None):
if not self.args.populate_cache:
return
Comment on lines +374 to +375
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me!

out of curiosity - is there no real difference between self.args.populate_cache and get_flags().POPULATE_CACHE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real difference, no. self.args is Flags which we should be using much more where we can. I think the places we useget_flags instead of self.args was just to get the click feature branch over the line to be merged.


start_populate_cache = time.perf_counter()
if get_flags().CACHE_SELECTED_ONLY is True:
adapter.set_relations_cache(self.manifest, required_schemas=required_schemas)
Expand Down
14 changes: 14 additions & 0 deletions tests/adapter/dbt/tests/adapter/caching/test_caching.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,20 @@ def test_cache(self, project):
self.run_and_inspect_cache(project, run_args)


class TestNoPopulateCache(BaseCachingTest):
@pytest.fixture(scope="class")
def models(self):
return {
"model.sql": model_sql,
}

def test_cache(self, project):
# --no-populate-cache still allows the cache to populate all relations
# under a schema, so the behavior here remains the same as other tests
run_args = ["--no-populate-cache", "run"]
self.run_and_inspect_cache(project, run_args)


class TestCachingLowerCaseModel(BaseCachingLowercaseModel):
pass

Expand Down