Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: implement the dbt-meshify connect command #104

Merged
merged 19 commits into from
Aug 8, 2023

Conversation

dave-connors-3
Copy link
Collaborator

@dave-connors-3 dave-connors-3 commented Jul 25, 2023

This PR implements the dbt-meshify connect command to stitch together multiple projects that either install each other as packages, or have sources declared in a downstream project that are models of an upstream project.

Usage

I added a few flags specific for the connect command:

  • --project-paths -- accepts multuple args of paths to dbt projects to connect -- this is only one character off of the existing --project-path arg for other single-project commands, so open to naming feedback here
  • --projects-dir -- accepts a path to a directory containing a set of dbt projects to connect
  • --exclude-projects -- accepts mutuple args of project names to exclude from the set returned by --projects-dir
# connect two projects
dbt-meshify connect --project-paths path/to/project_a  path/to/project_b 

# connect as many as you have
dbt-meshify connect --projects-dir path/to/several/projects  --exclude-projects bad_project

Changes

A few major changes worth calling out!

linker.py

  1. There was a bug in the package dependency detection! It was returning the same resource as the upstream and downstream resource. I added logic to parse through the child_map of the shared resource to properly understand where the upstream resource was being used.
  2. A second issue with the package dependency detection was that it lacked the backwards and forwards logic that the source project had, meaning that the order of the project arguments to linker.dependencies() changed the output of the method.
  3. I added project names to the ProjectDependency class to make it a bit easier to interact with projects
  4. I added the logic to edit the dbt Projects to the linker class -- this chunk of code felt possible misplaced. I would not argue with the fact that we might want this on a class similar to the DbtSubprojectCreator class that operated on the contents of DbtProjects and keep Linker specific to detecting dependencies.

dbt_project_editors.py (formerly dbt_project_creator)

Added a base class (DbtProjectEditor) to contain some common project editing methods that are then extended in DbtSubprojectCreator. Some of these methods are used in linker.resolve_dependency. Like I said above, may make more sense to make a DbtProjectConnector(DbtProjectEditor) class that houses the linker.resolve_dependency() method.

Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3 Here's my first round of reviews. I've left some comments around typing, the use of forward and backward dependency checks, and other marginalia.

There are many moving parts in this one, so I'm going to do another pass later today.

Comment on lines 176 to 187
backward_relations = self._find_relation_dependencies(
source_relations={
model.relation_name
for model in other_project.models.values()
if model.relation_name is not None
},
target_relations={
model.relation_name
for model in project.models.values()
if model.relation_name is not None
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that we need this backwards_relations calculation. Given that we're working with models for both the sources and the targets, and given that intersection is undirected, this should yield the same results as the relations call on line 141.

Is my understanding accurate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! that's absolutely correct -- we can perform the reverse operation on the same set of relations in relations

dbt_meshify/linker.py Outdated Show resolved Hide resolved
Comment on lines 98 to 99
dbt_project_combinations = [combo for combo in combinations(dbt_projects, 2)]
for dbt_project_combo in dbt_project_combinations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will certainly work for now, but will resulting in a combinatorial explosion if we have more than a few projects. In a future issue, I'd recommend we update some of our linker code to be agnostic to the number of projects being evaluated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update, this implementation has two rough edge case.

Suppose we are handling three projects, A, B, and C, such that there exists dependencies A -> B and A -> C. Under this logic this script will check for any dependencies between B and C (which is a good thing!), but will find that both B and C have a shared dependency ... their common model from project A!

Furthermore, this also shows that the same upstream resource from Project A can be identified as a dependency in B and C multiple times. This suggests to me that project dependencies should be pooled into a set (for deduplication) prior to evaluation.

By way of an example:

dbt-meshify-py3.11) > $ dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a                                              [±connect-command ●●]
15:52:52 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:54 | INFO | Found dependency between src_proj_a and src_proj_b: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
15:52:54 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_a and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='model.dest_proj_a.downstream_model', downstream_project_name='dest_proj_a', type=<ProjectDependencyType.Package: 'package'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_b and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
Traceback (most recent call last):
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/main.py", line 112, in connect
    linker.resolve_dependency(
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/linker.py", line 281, in resolve_dependency
    downstream_editor.update_resource_yml_entry(
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/dbt_project_editors.py", line 60, in update_resource_yml_entry
    full_yml_entry = self.file_manager.read_file(current_yml_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/file_manager.py", line 63, in read_file
    return yaml.load(full_path.read_text())
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1058, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'

15:52:54 | ERROR | Error resolving dependency : ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>) [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'

In this log file, we are seeing an error pop up since the script noticed an extra opportunity to rewrite the source ref in src_proj_b despite 1) it already having been resolved, and 2) it being identified via a non-dependent project!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the deep dive here! the solution that you mentioned worked perfectly given the _hash method you implemented -- pooling the dependecies into a set for this command ends up with 2 instead of 3 dependecies, which properly resolve:

dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a
16:45:31 | INFO | Executing dbt parse...
16:45:31 | INFO | Generating catalog with dbt docs generate...
16:45:32 | INFO | Executing dbt parse...
16:45:32 | INFO | Generating catalog with dbt docs generate...
16:45:32 | INFO | Executing dbt parse...
16:45:32 | INFO | Generating catalog with dbt docs generate...
16:45:33 | INFO | Found 1 dependencies between src_proj_a and src_proj_b
16:45:33 | INFO | Found 1 dependencies between src_proj_a and dest_proj_a
16:45:33 | INFO | Found 1 dependencies between src_proj_b and dest_proj_a
16:45:33 | INFO | Found 2 unique dependencies between all projects.
16:45:33 | INFO | Resolving dependency between model.src_proj_a.shared_model and source.src_proj_b.src_proj_a.shared_model
16:45:33 | INFO | Adding public access to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
16:45:33 | INFO | Adding contract to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
16:45:33 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
16:45:33 | INFO | Resolving dependency between model.src_proj_a.shared_model and model.dest_proj_a.downstream_model
16:45:33 | INFO | Adding public access to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
16:45:33 | INFO | Adding contract to shared_model at models/_models.yml
16:45:33 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
16:45:33 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml

logger.info(
f"No dependencies found between {dbt_project_combo[0].name} and {dbt_project_combo[1].name}"
)
for dependency in linker.dependencies(dbt_project_combo[0], dbt_project_combo[1]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to compute this twice.

Suggested change
for dependency in linker.dependencies(dbt_project_combo[0], dbt_project_combo[1]):
for dependency in dependencies:

Comment on lines +19 to +23
class YMLOperationType(str, Enum):
"""ProjectDependencyTypes define how the dependency relationship was defined."""

Move = "move"
Delete = "delete"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a jump on ChangeSet work, eh 😝 Happy to see this pattern getting some reps in before the refactoring work.

if not model_node:
raise KeyError(f"Resource {model} not found in manifest")
meshify_constructor = DbtMeshConstructor(
project_path=self.subproject.parent_project.path, node=model_node, catalog=None
project_path=self.project.parent_project.path, node=model_node, catalog=None # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with the typing for DbtMeshConstructors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went pretty wild with type ignoring 😃

this one looks like a miss, but was having trouble convinving mypy that the DbtSubprojectCreator could only get a DbtSubproject

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the code on 102 would be sufficient to suppress the type hints, but alas!

Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3 I've review the code more thoroughly, and tried out a few test cases. Alas, I was able to find a logical defect in the current way project dependencies are handled, in that we can find shared dependencies between non-dependent projects, and we can try to evaluate the same operation/dependency twice. See the relevant comment below for details and a way to replicate!

Once that has been resolved, this should be good to go, IMHO.

tests/integration/test_connect_command.py Outdated Show resolved Hide resolved
tests/integration/test_connect_command.py Outdated Show resolved Hide resolved
Comment on lines +217 to +222
def resolve_dependency(
self,
dependency: ProjectDependency,
upstream_project: DbtProject,
downstream_project: DbtProject,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3: We previously discussed the appropriate place for this type of logic, and after some thinking I believe that this approach is the right one. Specifically, having the Linker class present a resolve_dependency method that uses ProjectDependencys to identify file updates. In a ChangeSet world, this method will return a ChangeSet that describes what file operations need to be made for each Ref update, which is beautiful to work with. ✅

dbt_meshify/main.py Show resolved Hide resolved
Comment on lines 98 to 99
dbt_project_combinations = [combo for combo in combinations(dbt_projects, 2)]
for dbt_project_combo in dbt_project_combinations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update, this implementation has two rough edge case.

Suppose we are handling three projects, A, B, and C, such that there exists dependencies A -> B and A -> C. Under this logic this script will check for any dependencies between B and C (which is a good thing!), but will find that both B and C have a shared dependency ... their common model from project A!

Furthermore, this also shows that the same upstream resource from Project A can be identified as a dependency in B and C multiple times. This suggests to me that project dependencies should be pooled into a set (for deduplication) prior to evaluation.

By way of an example:

dbt-meshify-py3.11) > $ dbt-meshify connect --project-paths test-projects/source-hack/src_proj_a test-projects/source-hack/src_proj_b test-projects/source-hack/dest_proj_a                                              [±connect-command ●●]
15:52:52 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:53 | INFO | Executing dbt parse...
15:52:53 | INFO | Generating catalog with dbt docs generate...
15:52:54 | INFO | Found dependency between src_proj_a and src_proj_b: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
15:52:54 | SUCCESS | Successfully deleted unnecessary source: source.src_proj_b.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to src_proj_b's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_a and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='src_proj_a', downstream_resource='model.dest_proj_a.downstream_model', downstream_project_name='dest_proj_a', type=<ProjectDependencyType.Package: 'package'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully updated model refs: model.dest_proj_a.downstream_model now references model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully added src_proj_a to dest_proj_a's dependencies.yml
15:52:54 | INFO | Found dependency between src_proj_b and dest_proj_a: ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>)
15:52:54 | INFO | Adding public access to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully update model access : model.src_proj_a.shared_model is now public
15:52:54 | INFO | Adding contract to shared_model at models/_models.yml
15:52:54 | SUCCESS | Successfully added contract to model: model.src_proj_a.shared_model
15:52:54 | SUCCESS | Successfully replaced source function with ref to upstream resource: source.src_proj_b.src_proj_a.shared_model now calls model.src_proj_a.shared_model directly
Traceback (most recent call last):
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/main.py", line 112, in connect
    linker.resolve_dependency(
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/linker.py", line 281, in resolve_dependency
    downstream_editor.update_resource_yml_entry(
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/dbt_project_editors.py", line 60, in update_resource_yml_entry
    full_yml_entry = self.file_manager.read_file(current_yml_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nicholas/projects/nicholasyager/dbt-meshify-2/dbt_meshify/storage/file_manager.py", line 63, in read_file
    return yaml.load(full_path.read_text())
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1058, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'

15:52:54 | ERROR | Error resolving dependency : ProjectDependency(upstream_resource='model.src_proj_a.shared_model', upstream_project_name='dest_proj_a', downstream_resource='source.src_proj_b.src_proj_a.shared_model', downstream_project_name='src_proj_b', type=<ProjectDependencyType.Source: 'source'>) [Errno 2] No such file or directory: 'test-projects/source-hack/src_proj_b/models/_sources.yml'

In this log file, we are seeing an error pop up since the script noticed an extra opportunity to rewrite the source ref in src_proj_b despite 1) it already having been resolved, and 2) it being identified via a non-dependent project!

@dave-connors-3
Copy link
Collaborator Author

@nicholasyager thanks for the thorough review! i believe i've resolved your comments for now 🚀 !

Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3 Thanks for making the fixes! This PR is shaping up really well. I've added two non-blocking recommendations on the theme of "returning early", but they're more for readability than performance. Great work tying all of these pieces together! ✅

dbt_meshify/main.py Outdated Show resolved Hide resolved
dbt_meshify/main.py Outdated Show resolved Hide resolved
dave-connors-3 and others added 2 commits August 8, 2023 08:17
Co-authored-by: Nicholas A. Yager <[email protected]>
Co-authored-by: Nicholas A. Yager <[email protected]>
@dave-connors-3 dave-connors-3 merged commit d66e4fc into main Aug 8, 2023
@nicholasyager nicholasyager deleted the connect-command branch September 1, 2023 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants