Split command #63

dave-connors-3 · 2023-06-13T22:08:33Z

Opening up a PR to collaborate on the split command!

Right now, this command:

accepts the standard args for select/exclude/selector to identify which resources should be moved
registers those resources as a subproject of the original project
calls the subproject.initialize() method.

The initialize() method:

iterates over the resources in the subproject
- for resources represented by a .sql , .py, or .csv file, move that file into the subdirectory, then move the resource's yml entry, if necessary
- for .yml defined resources, copy over only the resource yml entry.
writes the subproject's dbt_project.yml

These changes requires some new DbtYmlEditor methods, especially to handle the nuance of source files and make our methods more generic to other resource types that follow similar patterns. Additionally, there's a lot of variable renaming for accuracy's sake, so apologies in advance for lots of edits.

Things this command does not do, but will need to before shipping:

declare model access and groups for boundary nodes / leaf nodes of selected resources (question: do we need groups at all?)
edit children of boundary nodes / leaf nodes to use the multi-argument ref function
write a dependencies.yml file to the parent project
copy necessary macros from parent project to subproject
tests!
anything else we can think of!

I would really love some input here on syntax, logic, and design. This is not at all complete, but

… be generic

dave-connors-3 · 2023-06-15T16:35:02Z

@nicholasyager -- this now also moves relevant custom macros and blindly copies the packages.yml to the new subdirectory. There may be a small bug in one of the yml entry moving methods that I am trying to track down, but if use the split_proj in the test-projects directory and run a:

> dbt-meshify split my_project --select +orders
> cd my_project
> dbt deps && dbt compile

you should get into a runnable state 🤞 let me know if that is the case! then we can work to actually make this code good which it currently is not

dbt_meshify/dbt_projects.py

dave-connors-3 · 2023-06-16T13:41:33Z

dbt_meshify/dbt_projects.py

+        # this one appears in the project yml, but i don't think it should be written
+        contents.pop("query-comment")
+        contents = filter_empty_dict_items(contents)
+        self.file_manager.write_file(Path("dbt_project.yml"), contents)


may not want the file manager in this class -- to date, it's only in the DbtMeshConstructor class that's operating on resource files.

Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProjects to write files. Open to other approaches, though!

I think i'm with you hre -- would it make sense to just have a DbtSubprojectWriter class or something to that effect?

from dbt_meshify.storage.file_manager import DbtFileManager class DbtSubprojectWriter: def __init__(self, subproject: DbtSubproject, ): self.subproject = subproject self.file_manager = DbtFileManager(read_project_path=subproject.parent.path, write_project_path=subproject.path) def inititalize(self): ... def write_project(self): pass def write_package_yml(self): pass def write_package_directory(self): pass

pseudocode written with the knowledge that the filemanager stuff is wonky as hell

nicholasyager · 2023-06-17T23:43:30Z

@dave-connors-3 I plan on looking at the code tomorrow evening, but in the meantime I did have a chance to noodle with the command as it's currently implemented.

My specific invocation, for reference:

poetry run dbt-meshify split --select +orders revenue

Here are some thoughts in no particular order:

Since we're copy packages.yml, we could also copy over the dbt_packages directory too. This will let the user needing to run dbt deps before their first build of the subproject.
We may want to have some logic in place that redistributes/partitions groups into the subproject. Currently, if a model is part of a group and that model is moved to a new project, it is retaining its old group and compilation fails.
As a user, it would be useful to have information logged during execution communicating what has been done. After running the split command for the first time, I was left thinking, "huh. I wonder if that worked." 😝

dbt_meshify/dbt_projects.py

nicholasyager · 2023-06-19T21:25:07Z

dbt_meshify/dbt_projects.py

+        # this one appears in the project yml, but i don't think it should be written
+        contents.pop("query-comment")
+        contents = filter_empty_dict_items(contents)
+        self.file_manager.write_file(Path("dbt_project.yml"), contents)


Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProjects to write files. Open to other approaches, though!

nicholasyager · 2023-06-19T21:31:12Z

dbt_meshify/storage/yaml_editors.py

+    def move_resource(self):
+        """
+        move a resource file from one project to another
+
+        """
+        current_path = self.get_resource_path()
+        new_path = self.subdirectory / current_path
+        new_path.parent.mkdir(parents=True, exist_ok=True)
+        current_path.rename(new_path)


My gut reaction is that this is a little odd. I suspect that it would be more ergonomic/useful to pass the destination path in as an argument. Having that been said, I'm only ~45% confident in this opinion.

i don't disagree! i think it's mostly dependent on how much flexibility we want to afford users on where resources land in the new project

dave-connors-3 · 2023-06-20T13:37:15Z

@nicholasyager agree on all fronts! made the small adjustments you recommended, and will spend some time today refactoring the initialization into a separate class that operates on dbt projects rather than have dbt projects have knowledge of the filesystem.

re: logging -- couldn't agree more, I'll open up a new issue so we can tackle it separately for all the commands we have! would love a dbt-esque logging experience in here too

nicholasyager

Here are some light comments. I'm going to review the file editors more in depth soon:tm: once I have some more bandwidth.

dbt_meshify/dbt_projects.py

dbt_meshify/main.py

dbt_meshify/storage/dbt_project_creator.py

nicholasyager · 2023-07-03T11:53:28Z

dbt_meshify/storage/file_content_editors.py

+
+def process_model_yml(model_yml: Dict[str, Any]):
+    """Processes the yml contents to be written back to a file"""
+    model_ordered_dict = OrderedDict.fromkeys(


I wonder if it would be possible to use the schema.json for manifests from dbt-core to track these keys automatically.

dbt_meshify/storage/file_content_editors.py

b-per · 2023-07-03T12:21:06Z

dbt_meshify/dbt_projects.py

@@ -152,7 +159,19 @@ def get_catalog_entry(self, unique_id: str) -> Optional[CatalogTable]:

    def get_manifest_node(self, unique_id: str) -> Optional[ManifestNode]:
        """Returns the catalog entry for a model in the dbt project's catalog"""
-        return self.manifest.nodes.get(unique_id)
+        if unique_id.split(".")[0] in [


What about exposure and metric/measure? Are those not listed on purpose?

this logic was to get nodes vs the resources that are also other top-level keys in the manifest. Looking with fresh eyes, I would guess there's a dbt-core class that represents this type that we could leverage instead of a list of strings!

dbt_meshify/storage/dbt_project_creator.py

b-per · 2023-07-03T12:33:05Z

test-projects/split/split_proj/models/marts/__models.yml

@@ -2,7 +2,8 @@ version: 2

 models:
  - name: customers
-    description: Customer overview data mart, offering key details for each unique customer. One row per customer.
+    description: Customer overview data mart, offering key details for each unique
+      customer. One row per customer.


All the YAML files have some odd newlines showing up now.

b-per · 2023-07-03T12:38:20Z

tests/integration/test_split_command.py

+        x_proj_ref = "{{ ref('my_new_project', 'stg_orders') }}"
+        child_sql = (Path(dest_project_path) / "models" / "marts" / "orders.sql").read_text()
+        assert x_proj_ref in child_sql
+        teardown_test_project()


Would a teardown after an assert work as expected? I'd think that it would stop at the assert but it might just be my lack of pytest knowledge.

it seems to work locally!

b-per · 2023-07-03T12:42:02Z

tests/unit/test_update_ref_functions.py

+    return yaml.safe_load(yml_str)
+
+
+class TestRemoveResourceYml:


We should have some tests when using model versions as well

b-per · 2023-07-03T12:44:25Z

tests/integration/test_subproject_creator.py

+    with open("test/profiles.yml", "w") as f:
+        f.write(yaml.dump(test_project_profile))
+    if write_packages_yml:
+        with open("test/packages.yml", "w") as f:
+            f.write(yaml.dump(test_package_yml))


We should use pathlib like in the other parts of the code rather than open.

b-per · 2023-07-03T12:50:02Z

tests/integration/test_subproject_creator.py

+
+
+def teardown_new_project():
+    os.system("rm -rf test-projects/test")


😉 let's use shutil.rmtree instead of os.system.

More generally, it is now recommended to use subprocess.run rather than os.system when running commands from Python

nicholasyager

Added some more feedback here!

My tl;dr of the file manager is that it is sufficient, but there are a few opportunities for simplification. I also found a couple areas where defects can trickle in (raising exceptions instead of returning None, using specific exceptions, etc). One major benefit of having this functionality written is that it makes clear to me where we need cleaner APIs -- specifically around manipulating resources. I wonder if a pre-v1 step will be to create first-class classes for our resources and manage SerDe. Definitely a task for later!

Omissions:

I've not dug into the yml files generated yet.
I'm going to trust the updated tests.

Open questions:

How is mypy holding up? I suspect there are a few typing issues around None values and iterables.

nicholasyager · 2023-07-03T15:15:53Z

dbt_meshify/storage/file_content_editors.py

+            # find yml path for resoruces that are not defined
+            yml_path = Path(self.node.patch_path.split("://")[1]) if self.node.patch_path else None
+        else:
+            yml_path = Path(self.node.original_file_path)


Perhaps this should leverage get_resource_path()

nicholasyager · 2023-07-03T15:17:41Z

dbt_meshify/storage/file_content_editors.py

+
+            if resource_path is None:
+                # If this happens, then the model doesn't have a model file, either, which is cause for alarm.
+                raise Exception(f"Unable to locate the file defining {self.node.name}. Aborting")


I'd love for this to raise a specific exception. Maybe a ModelMissingError, since we cannot recover from this type of issue.

would that require a new ModelMissingError class that extends Exception?

Yep! Likely a good candidate for a future refactor.

nicholasyager · 2023-07-03T15:18:52Z

dbt_meshify/storage/file_content_editors.py

+        if isinstance(models_yml, str):
+            raise Exception(f"Unexpected string values in dumped model data in {yml_path}.")


This seems a little odd. Is there a more systematic way to ensure that read_file returns the appropriate type?

nicholasyager · 2023-07-03T15:20:32Z

dbt_meshify/storage/file_content_editors.py

+        if model_path is None:
+            raise Exception(f"Unable to find path to model {self.node.name}. Aborting.")


Should there ever be a time when get_resource_path returns None, or do we want this to error out? My gut says erroring out is the preferred approach, that way do don't introduce surface area for defects.

dbt_meshify/storage/file_content_editors.py

nicholasyager · 2023-07-03T15:32:24Z

dbt_meshify/storage/file_content_editors.py

+    def add_group_to_model_yml(model_name: str, group: Group, models_yml: Dict[str, Any]):
+        """Add group and access configuration to a model's YAMl properties."""
+        # parse the yml file into a dictionary with model names as keys
+        models = resources_yml_to_dict(models_yml)
+        model_yml = models.get(model_name) or {"name": model_name, "columns": [], "config": {}}
+
+        model_yml.update({"group": group.name})
+        models[model_name] = process_model_yml(model_yml)
+
+        models_yml["models"] = list(models.values())
+        return models_yml


This method, add_access_to_model_yml, and add_group_to_yml are very similar in function -- specifically they're adding properties to a specific model property entry. This is fine for now, since this is a first draft. Having that been said, this might be something to add to the IO refactoring work I've been pondering.

these were copy pasted in order to access each method separately, rather than only as part of add_access_and_group_to_yml -- I think the steps in all of these methods are overlapping and too verbose -- i want to shrink this file dramatically!

dbt_meshify/dbt_projects.py

Co-authored-by: Nicholas A. Yager <[email protected]> Co-authored-by: Benoit Perigaud <[email protected]>

nicholasyager

What a chonky feature! I did another 🔍 of this PR and found a few small areas of refactoring and a CLI UX oddity. Having this been said, there were no flaming red flags along the way, so it's a ✅ from me! Let's get this out there so we can get feedback.

dbt_meshify/dbt_projects.py

nicholasyager · 2023-07-07T14:37:04Z

dbt_meshify/main.py

+    target_directory = Path(create_path) if create_path else None
+    subproject_creator = DbtSubprojectCreator(
+        subproject=subproject, target_directory=target_directory
+    )


I am of the opinion that DbtSubprojectCreator should not allow None target_directory values. Instead, the calling method should be responsible for passing a valid target. This approach will allow us to reduce the complexity of the underlying API/class.

This is not something that needs to be tackled here, but rather in a refactor ticket.

that's good feedback, definitely agree -- IIRC the init handles this, but it's definitely not necessary to hide that behavior!

dbt_meshify/storage/file_content_editors.py

nicholasyager · 2023-07-07T14:49:24Z

dbt_meshify/utilities/grouper.py

@@ -24,9 +24,9 @@ class ResourceGrouper:
    recommendations based on the reference characteristics for each resource.
    """

-    def __init__(self, project: DbtProject):


Great use of types!

nicholasyager · 2023-07-07T15:00:54Z

dbt_meshify/main.py

 @exclude
 @project_path
 @select
 @selector
-def split():
+def split(project_name, select, exclude, project_path, selector, create_path):


Since select now allows multiple arguments, we cannot have --select before the project_name argument.

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split --select "+orders" revenue Usage: dbt-meshify split [OPTIONS] PROJECT_NAME Try 'dbt-meshify split --help' for help. Error: Missing argument 'PROJECT_NAME'.

Instead, we need to order arguments/options specifically

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split revenue --select "+orders"

I don't this is a blocker per se. At the very least, documentation should be refined in a follow-up.

Co-authored-by: Nicholas A. Yager <[email protected]>

dbt_meshify/storage/file_content_editors.py

dave-connors-3 added 11 commits June 8, 2023 14:48

POC move a model

f316613

move dbt project back

b377725

move resource and move resource yml entry methods, lots of renames to…

0ffcb8e

… be generic

move resources and their yml entries

6ee184e

attempt to solve some mypys

c7846d0

add some workarounds for source yml methods

b2a2ba1

remove breakpoint

f04f888

add method for writing the project file

cadeae9

add custom macro for testing

34469f4

method for moving custom macro files

02d3ef2

move packages yml

1e1f15f

dave-connors-3 commented Jun 16, 2023

View reviewed changes

dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved

dave-connors-3 commented Jun 16, 2023

View reviewed changes

dave-connors-3 added 2 commits June 16, 2023 10:44

small tweaks, unit tests for removing yml entries

b29c192

add add_yml_entry tests

131a31d

nicholasyager reviewed Jun 19, 2023

View reviewed changes

rename get_manifest_node

cd2293b

dave-connors-3 added 10 commits June 20, 2023 08:39

comments from review

8d69608

refactor initialize methods into separate class

dd614b7

refactor file manager for new use case and correct broken tests

eaef0db

tests for subproject initialization

59f2d93

rename file and update imports

897e869

method for updating sql refs to two arguments

5eb952a

python code editor method

7089466

add step for updating refs

e1fb68e

remove accidentally committed effects of meshify

53390a5

disambiguate access and group yml operations

e9db0e1

nicholasyager reviewed Jul 3, 2023

View reviewed changes

b-per reviewed Jul 3, 2023

View reviewed changes

dbt_meshify/storage/dbt_project_creator.py Outdated Show resolved Hide resolved

b-per reviewed Jul 3, 2023

View reviewed changes

nicholasyager reviewed Jul 3, 2023

View reviewed changes

dave-connors-3 commented Jul 5, 2023

View reviewed changes

dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved

dave-connors-3 and others added 2 commits July 5, 2023 09:14

Apply suggestions from code review

9128ad6

Co-authored-by: Nicholas A. Yager <[email protected]> Co-authored-by: Benoit Perigaud <[email protected]>

merge main

de3476b

dave-connors-3 mentioned this pull request Jul 6, 2023

add logging info #73

Merged

dave-connors-3 added 7 commits July 6, 2023 14:33

merge main

d907f25

fix subproject select resources method

db8c1e3

revert groups from errant commit

d7868ad

refactor test to clone project

e8f2791

delete yaml editor file

3db0141

add basic logging to split operation

c009159

update test project setup to include seeding db

307cf66

nicholasyager approved these changes Jul 7, 2023

View reviewed changes

dave-connors-3 mentioned this pull request Jul 7, 2023

update docs to reflect multiselect behavior of select #93

Closed

dave-connors-3 and others added 2 commits July 7, 2023 10:12

Update dbt_meshify/storage/file_content_editors.py

e20f99e

Co-authored-by: Nicholas A. Yager <[email protected]>

Apply suggestions from code review

4d5e2df

Co-authored-by: Nicholas A. Yager <[email protected]>

dave-connors-3 commented Jul 7, 2023

View reviewed changes

dbt_meshify/storage/file_content_editors.py Outdated Show resolved Hide resolved

dave-connors-3 and others added 2 commits July 7, 2023 10:54

change tpye hint, use class method

dda0021

Update dbt_meshify/storage/file_content_editors.py

b222ef1

dave-connors-3 mentioned this pull request Jul 7, 2023

create specific errors for missing files #94

Closed

dave-connors-3 merged commit 764ad1f into main Jul 7, 2023

nicholasyager deleted the split-command branch July 25, 2023 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split command #63

Split command #63

dave-connors-3 commented Jun 13, 2023 •

edited

Loading

dave-connors-3 commented Jun 15, 2023 •

edited

Loading

dave-connors-3 Jun 16, 2023

nicholasyager Jun 19, 2023

dave-connors-3 Jun 20, 2023

nicholasyager commented Jun 17, 2023

nicholasyager Jun 19, 2023

nicholasyager Jun 19, 2023

dave-connors-3 Jun 20, 2023

dave-connors-3 commented Jun 20, 2023

nicholasyager left a comment

nicholasyager Jul 3, 2023

b-per Jul 3, 2023

dave-connors-3 Jul 5, 2023

b-per Jul 3, 2023

b-per Jul 3, 2023

dave-connors-3 Jul 5, 2023

b-per Jul 3, 2023

b-per Jul 3, 2023

b-per Jul 3, 2023

nicholasyager left a comment

nicholasyager Jul 3, 2023

nicholasyager Jul 3, 2023

dave-connors-3 Jul 7, 2023 •

edited

Loading

nicholasyager Jul 7, 2023 •

edited

Loading

nicholasyager Jul 3, 2023

nicholasyager Jul 3, 2023

nicholasyager Jul 3, 2023

dave-connors-3 Jul 7, 2023

nicholasyager left a comment

nicholasyager Jul 7, 2023

dave-connors-3 Jul 7, 2023

nicholasyager Jul 7, 2023

nicholasyager Jul 7, 2023



		def teardown_new_project():
		os.system("rm -rf test-projects/test")

		if isinstance(models_yml, str):
		raise Exception(f"Unexpected string values in dumped model data in {yml_path}.")

		if model_path is None:
		raise Exception(f"Unable to find path to model {self.node.name}. Aborting.")

Split command #63

Split command #63

Conversation

dave-connors-3 commented Jun 13, 2023 • edited Loading

dave-connors-3 commented Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholasyager commented Jun 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dave-connors-3 commented Jun 20, 2023

nicholasyager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholasyager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dave-connors-3 Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

nicholasyager Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholasyager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dave-connors-3 commented Jun 13, 2023 •

edited

Loading

dave-connors-3 commented Jun 15, 2023 •

edited

Loading

dave-connors-3 Jul 7, 2023 •

edited

Loading

nicholasyager Jul 7, 2023 •

edited

Loading