-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: implement the dbt-meshify connect
command
#104
Conversation
…bprojectCreator class
…any tests at the moment" This reverts commit 68b897d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dave-connors-3 Here's my first round of reviews. I've left some comments around typing, the use of forward and backward dependency checks, and other marginalia.
There are many moving parts in this one, so I'm going to do another pass later today.
dbt_meshify/linker.py
Outdated
backward_relations = self._find_relation_dependencies( | ||
source_relations={ | ||
model.relation_name | ||
for model in other_project.models.values() | ||
if model.relation_name is not None | ||
}, | ||
target_relations={ | ||
model.relation_name | ||
for model in project.models.values() | ||
if model.relation_name is not None | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident that we need this backwards_relations
calculation. Given that we're working with models for both the sources and the targets, and given that intersection
is undirected, this should yield the same results as the relations
call on line 141.
Is my understanding accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! that's absolutely correct -- we can perform the reverse operation on the same set of relations in relations
dbt_meshify/main.py
Outdated
dbt_project_combinations = [combo for combo in combinations(dbt_projects, 2)] | ||
for dbt_project_combo in dbt_project_combinations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will certainly work for now, but will resulting in a combinatorial explosion if we have more than a few projects. In a future issue, I'd recommend we update some of our linker code to be agnostic to the number of projects being evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update, this implementation has two rough edge case.
Suppose we are handling three projects, A
, B
, and C
, such that there exists dependencies A -> B
and A -> C
. Under this logic this script will check for any dependencies between B
and C
(which is a good thing!), but will find that both B
and C
have a shared dependency ... their common model from project A!
Furthermore, this also shows that the same upstream resource from Project A can be identified as a dependency in B
and C
multiple times. This suggests to me that project dependencies should be pooled into a set (for deduplication) prior to evaluation.
By way of an example:
dbt-meshify-py3.11) > $ dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a [±connect-command ●●]
15:52:52 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:54 | INFO | Found dependency between src_proj_a and src_proj_b: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
15:52:54 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_a and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='model.dest_proj_a.downstream_model', downstream_project_name='dest_proj_a', type=<ProjectDependencyType.Package: 'package'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_b and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
Traceback (most recent call last):
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/main.py", line 112, in connect
linker.resolve_dependency(
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/linker.py", line 281, in resolve_dependency
downstream_editor.update_resource_yml_entry(
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/dbt_project_editors.py", line 60, in update_resource_yml_entry
full_yml_entry = self.file_manager.read_file(current_yml_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/file_manager.py", line 63, in read_file
return yaml.load(full_path.read_text())
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1058, in read_text
with self.open(mode='r', encoding=encoding, errors=errors) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1044, in open
return io.open(self, mode, buffering, encoding, errors, newline)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'
15:52:54 | ERROR | Error resolving dependency : ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>) [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'
In this log file, we are seeing an error pop up since the script noticed an extra opportunity to rewrite the source ref in src_proj_b despite 1) it already having been resolved, and 2) it being identified via a non-dependent project!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate the deep dive here! the solution that you mentioned worked perfectly given the _hash
method you implemented -- pooling the dependecies into a set for this command ends up with 2 instead of 3 dependecies, which properly resolve:
dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a
16:45:31 | INFO | Executing dbt parse...
16:45:31 | INFO | Generating catalog with dbt docs generate...
16:45:32 | INFO | Executing dbt parse...
16:45:32 | INFO | Generating catalog with dbt docs generate...
16:45:32 | INFO | Executing dbt parse...
16:45:32 | INFO | Generating catalog with dbt docs generate...
16:45:33 | INFO | Found 1 dependencies between src_proj_a and src_proj_b
16:45:33 | INFO | Found 1 dependencies between src_proj_a and dest_proj_a
16:45:33 | INFO | Found 1 dependencies between src_proj_b and dest_proj_a
16:45:33 | INFO | Found 2 unique dependencies between all projects.
16:45:33 | INFO | Resolving dependency between model.src_proj_a.shared_model and source.src_proj_b.src_proj_a.shared_model
16:45:33 | INFO | Adding public access to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
16:45:33 | INFO | Adding contract to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
16:45:33 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
16:45:33 | INFO | Resolving dependency between model.src_proj_a.shared_model and model.dest_proj_a.downstream_model
16:45:33 | INFO | Adding public access to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
16:45:33 | INFO | Adding contract to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml
dbt_meshify/main.py
Outdated
logger.info( | ||
f"No dependencies found between {dbt_project_combo[0].name} and {dbt_project_combo[1].name}" | ||
) | ||
for dependency in linker.dependencies(dbt_project_combo[0], dbt_project_combo[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to compute this twice.
for dependency in linker.dependencies(dbt_project_combo[0], dbt_project_combo[1]): | |
for dependency in dependencies: |
class YMLOperationType(str, Enum): | ||
"""ProjectDependencyTypes define how the dependency relationship was defined.""" | ||
|
||
Move = "move" | ||
Delete = "delete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting a jump on ChangeSet work, eh 😝 Happy to see this pattern getting some reps in before the refactoring work.
if not model_node: | ||
raise KeyError(f"Resource {model} not found in manifest") | ||
meshify_constructor = DbtMeshConstructor( | ||
project_path=self.subproject.parent_project.path, node=model_node, catalog=None | ||
project_path=self.project.parent_project.path, node=model_node, catalog=None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with the typing for DbtMeshConstructors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went pretty wild with type ignoring 😃
this one looks like a miss, but was having trouble convinving mypy that the DbtSubprojectCreator
could only get a DbtSubproject
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the code on 102 would be sufficient to suppress the type hints, but alas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dave-connors-3 I've review the code more thoroughly, and tried out a few test cases. Alas, I was able to find a logical defect in the current way project dependencies are handled, in that we can find shared dependencies between non-dependent projects, and we can try to evaluate the same operation/dependency twice. See the relevant comment below for details and a way to replicate!
Once that has been resolved, this should be good to go, IMHO.
def resolve_dependency( | ||
self, | ||
dependency: ProjectDependency, | ||
upstream_project: DbtProject, | ||
downstream_project: DbtProject, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dave-connors-3: We previously discussed the appropriate place for this type of logic, and after some thinking I believe that this approach is the right one. Specifically, having the Linker
class present a resolve_dependency
method that uses ProjectDependency
s to identify file updates. In a ChangeSet world, this method will return a ChangeSet that describes what file operations need to be made for each Ref update, which is beautiful to work with. ✅
dbt_meshify/main.py
Outdated
dbt_project_combinations = [combo for combo in combinations(dbt_projects, 2)] | ||
for dbt_project_combo in dbt_project_combinations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update, this implementation has two rough edge case.
Suppose we are handling three projects, A
, B
, and C
, such that there exists dependencies A -> B
and A -> C
. Under this logic this script will check for any dependencies between B
and C
(which is a good thing!), but will find that both B
and C
have a shared dependency ... their common model from project A!
Furthermore, this also shows that the same upstream resource from Project A can be identified as a dependency in B
and C
multiple times. This suggests to me that project dependencies should be pooled into a set (for deduplication) prior to evaluation.
By way of an example:
dbt-meshify-py3.11) > $ dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a [±connect-command ●●]
15:52:52 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:54 | INFO | Found dependency between src_proj_a and src_proj_b: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
15:52:54 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_a and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='model.dest_proj_a.downstream_model', downstream_project_name='dest_proj_a', type=<ProjectDependencyType.Package: 'package'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_b and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
Traceback (most recent call last):
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/main.py", line 112, in connect
linker.resolve_dependency(
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/linker.py", line 281, in resolve_dependency
downstream_editor.update_resource_yml_entry(
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/dbt_project_editors.py", line 60, in update_resource_yml_entry
full_yml_entry = self.file_manager.read_file(current_yml_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/file_manager.py", line 63, in read_file
return yaml.load(full_path.read_text())
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1058, in read_text
with self.open(mode='r', encoding=encoding, errors=errors) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1044, in open
return io.open(self, mode, buffering, encoding, errors, newline)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'
15:52:54 | ERROR | Error resolving dependency : ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>) [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'
In this log file, we are seeing an error pop up since the script noticed an extra opportunity to rewrite the source ref in src_proj_b despite 1) it already having been resolved, and 2) it being identified via a non-dependent project!
@nicholasyager thanks for the thorough review! i believe i've resolved your comments for now 🚀 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dave-connors-3 Thanks for making the fixes! This PR is shaping up really well. I've added two non-blocking recommendations on the theme of "returning early", but they're more for readability than performance. Great work tying all of these pieces together! ✅
Co-authored-by: Nicholas A. Yager <[email protected]>
Co-authored-by: Nicholas A. Yager <[email protected]>
This PR implements the
dbt-meshify connect
command to stitch together multiple projects that either install each other as packages, or have sources declared in a downstream project that are models of an upstream project.Usage
I added a few flags specific for the
connect
command:--project-paths
-- accepts multuple args of paths to dbt projects to connect -- this is only one character off of the existing--project-path
arg for other single-project commands, so open to naming feedback here--projects-dir
-- accepts a path to a directory containing a set of dbt projects to connect--exclude-projects
-- accepts mutuple args of project names to exclude from the set returned by--projects-dir
Changes
A few major changes worth calling out!
linker.py
child_map
of the shared resource to properly understand where the upstream resource was being used.linker.dependencies()
changed the output of the method.ProjectDependency
class to make it a bit easier to interact with projectslinker
class -- this chunk of code felt possible misplaced. I would not argue with the fact that we might want this on a class similar to theDbtSubprojectCreator
class that operated on the contents of DbtProjects and keepLinker
specific to detecting dependencies.dbt_project_editors.py
(formerly dbt_project_creator)Added a base class (
DbtProjectEditor
) to contain some common project editing methods that are then extended inDbtSubprojectCreator
. Some of these methods are used in linker.resolve_dependency. Like I said above, may make more sense to make aDbtProjectConnector(DbtProjectEditor)
class that houses thelinker.resolve_dependency()
method.