-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split command #63
Split command #63
Conversation
@nicholasyager -- this now also moves relevant custom macros and blindly copies the > dbt-meshify split my_project --select +orders
> cd my_project
> dbt deps && dbt compile you should get into a runnable state 🤞 let me know if that is the case! then we can work to actually make this code good which it currently is not |
dbt_meshify/dbt_projects.py
Outdated
# this one appears in the project yml, but i don't think it should be written | ||
contents.pop("query-comment") | ||
contents = filter_empty_dict_items(contents) | ||
self.file_manager.write_file(Path("dbt_project.yml"), contents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may not want the file manager in this class -- to date, it's only in the DbtMeshConstructor
class that's operating on resource files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProject
s to write files. Open to other approaches, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'm with you hre -- would it make sense to just have a DbtSubprojectWriter
class or something to that effect?
from dbt_meshify.storage.file_manager import DbtFileManager
class DbtSubprojectWriter:
def __init__(self, subproject: DbtSubproject, ):
self.subproject = subproject
self.file_manager = DbtFileManager(read_project_path=subproject.parent.path, write_project_path=subproject.path)
def inititalize(self):
...
def write_project(self):
pass
def write_package_yml(self):
pass
def write_package_directory(self):
pass
pseudocode written with the knowledge that the filemanager stuff is wonky as hell
@dave-connors-3 I plan on looking at the code tomorrow evening, but in the meantime I did have a chance to noodle with the command as it's currently implemented. My specific invocation, for reference:
Here are some thoughts in no particular order:
|
dbt_meshify/dbt_projects.py
Outdated
# this one appears in the project yml, but i don't think it should be written | ||
contents.pop("query-comment") | ||
contents = filter_empty_dict_items(contents) | ||
self.file_manager.write_file(Path("dbt_project.yml"), contents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProject
s to write files. Open to other approaches, though!
dbt_meshify/storage/yaml_editors.py
Outdated
def move_resource(self): | ||
""" | ||
move a resource file from one project to another | ||
|
||
""" | ||
current_path = self.get_resource_path() | ||
new_path = self.subdirectory / current_path | ||
new_path.parent.mkdir(parents=True, exist_ok=True) | ||
current_path.rename(new_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut reaction is that this is a little odd. I suspect that it would be more ergonomic/useful to pass the destination path in as an argument. Having that been said, I'm only ~45% confident in this opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't disagree! i think it's mostly dependent on how much flexibility we want to afford users on where resources land in the new project
@nicholasyager agree on all fronts! made the small adjustments you recommended, and will spend some time today refactoring the initialization into a separate class that operates on dbt projects rather than have dbt projects have knowledge of the filesystem. re: logging -- couldn't agree more, I'll open up a new issue so we can tackle it separately for all the commands we have! would love a dbt-esque logging experience in here too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some light comments. I'm going to review the file editors more in depth soon:tm: once I have some more bandwidth.
|
||
def process_model_yml(model_yml: Dict[str, Any]): | ||
"""Processes the yml contents to be written back to a file""" | ||
model_ordered_dict = OrderedDict.fromkeys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be possible to use the schema.json for manifests from dbt-core to track these keys automatically.
@@ -152,7 +159,19 @@ def get_catalog_entry(self, unique_id: str) -> Optional[CatalogTable]: | |||
|
|||
def get_manifest_node(self, unique_id: str) -> Optional[ManifestNode]: | |||
"""Returns the catalog entry for a model in the dbt project's catalog""" | |||
return self.manifest.nodes.get(unique_id) | |||
if unique_id.split(".")[0] in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about exposure
and metric
/measure
? Are those not listed on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic was to get nodes
vs the resources that are also other top-level keys in the manifest. Looking with fresh eyes, I would guess there's a dbt-core class that represents this type that we could leverage instead of a list of strings!
@@ -2,7 +2,8 @@ version: 2 | |||
|
|||
models: | |||
- name: customers | |||
description: Customer overview data mart, offering key details for each unique customer. One row per customer. | |||
description: Customer overview data mart, offering key details for each unique | |||
customer. One row per customer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the YAML files have some odd newlines showing up now.
x_proj_ref = "{{ ref('my_new_project', 'stg_orders') }}" | ||
child_sql = (Path(dest_project_path) / "models" / "marts" / "orders.sql").read_text() | ||
assert x_proj_ref in child_sql | ||
teardown_test_project() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a teardown after an assert
work as expected? I'd think that it would stop at the assert
but it might just be my lack of pytest
knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to work locally!
return yaml.safe_load(yml_str) | ||
|
||
|
||
class TestRemoveResourceYml: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some tests when using model versions as well
with open("test/profiles.yml", "w") as f: | ||
f.write(yaml.dump(test_project_profile)) | ||
if write_packages_yml: | ||
with open("test/packages.yml", "w") as f: | ||
f.write(yaml.dump(test_package_yml)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use pathlib
like in the other parts of the code rather than open
.
|
||
|
||
def teardown_new_project(): | ||
os.system("rm -rf test-projects/test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😉 let's use shutil.rmtree
instead of os.system
.
More generally, it is now recommended to use subprocess.run
rather than os.system
when running commands from Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more feedback here!
My tl;dr of the file manager is that it is sufficient, but there are a few opportunities for simplification. I also found a couple areas where defects can trickle in (raising exceptions instead of returning None, using specific exceptions, etc). One major benefit of having this functionality written is that it makes clear to me where we need cleaner APIs -- specifically around manipulating resources. I wonder if a pre-v1 step will be to create first-class classes for our resources and manage SerDe. Definitely a task for later!
Omissions:
- I've not dug into the yml files generated yet.
- I'm going to trust the updated tests.
Open questions:
- How is mypy holding up? I suspect there are a few typing issues around None values and iterables.
# find yml path for resoruces that are not defined | ||
yml_path = Path(self.node.patch_path.split("://")[1]) if self.node.patch_path else None | ||
else: | ||
yml_path = Path(self.node.original_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should leverage get_resource_path()
|
||
if resource_path is None: | ||
# If this happens, then the model doesn't have a model file, either, which is cause for alarm. | ||
raise Exception(f"Unable to locate the file defining {self.node.name}. Aborting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love for this to raise a specific exception. Maybe a ModelMissingError
, since we cannot recover from this type of issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that require a new ModelMissingError
class that extends Exception
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Likely a good candidate for a future refactor.
if isinstance(models_yml, str): | ||
raise Exception(f"Unexpected string values in dumped model data in {yml_path}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little odd. Is there a more systematic way to ensure that read_file
returns the appropriate type?
if model_path is None: | ||
raise Exception(f"Unable to find path to model {self.node.name}. Aborting.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there ever be a time when get_resource_path
returns None, or do we want this to error out? My gut says erroring out is the preferred approach, that way do don't introduce surface area for defects.
def add_group_to_model_yml(model_name: str, group: Group, models_yml: Dict[str, Any]): | ||
"""Add group and access configuration to a model's YAMl properties.""" | ||
# parse the yml file into a dictionary with model names as keys | ||
models = resources_yml_to_dict(models_yml) | ||
model_yml = models.get(model_name) or {"name": model_name, "columns": [], "config": {}} | ||
|
||
model_yml.update({"group": group.name}) | ||
models[model_name] = process_model_yml(model_yml) | ||
|
||
models_yml["models"] = list(models.values()) | ||
return models_yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method, add_access_to_model_yml
, and add_group_to_yml
are very similar in function -- specifically they're adding properties to a specific model property entry. This is fine for now, since this is a first draft. Having that been said, this might be something to add to the IO refactoring work I've been pondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were copy pasted in order to access each method separately, rather than only as part of add_access_and_group_to_yml
-- I think the steps in all of these methods are overlapping and too verbose -- i want to shrink this file dramatically!
Co-authored-by: Nicholas A. Yager <[email protected]> Co-authored-by: Benoit Perigaud <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a chonky feature! I did another 🔍 of this PR and found a few small areas of refactoring and a CLI UX oddity. Having this been said, there were no flaming red flags along the way, so it's a ✅ from me! Let's get this out there so we can get feedback.
target_directory = Path(create_path) if create_path else None | ||
subproject_creator = DbtSubprojectCreator( | ||
subproject=subproject, target_directory=target_directory | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am of the opinion that DbtSubprojectCreator
should not allow None
target_directory
values. Instead, the calling method should be responsible for passing a valid target. This approach will allow us to reduce the complexity of the underlying API/class.
This is not something that needs to be tackled here, but rather in a refactor ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's good feedback, definitely agree -- IIRC the init handles this, but it's definitely not necessary to hide that behavior!
@@ -24,9 +24,9 @@ class ResourceGrouper: | |||
recommendations based on the reference characteristics for each resource. | |||
""" | |||
|
|||
def __init__(self, project: DbtProject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great use of types!
@exclude | ||
@project_path | ||
@select | ||
@selector | ||
def split(): | ||
def split(project_name, select, exclude, project_path, selector, create_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since select
now allows multiple arguments, we cannot have --select
before the project_name
argument.
(dbt-meshify-py3.11) > $ poetry run dbt-meshify split --select "+orders" revenue
Usage: dbt-meshify split [OPTIONS] PROJECT_NAME
Try 'dbt-meshify split --help' for help.
Error: Missing argument 'PROJECT_NAME'.
Instead, we need to order arguments/options specifically
(dbt-meshify-py3.11) > $ poetry run dbt-meshify split revenue --select "+orders"
I don't this is a blocker per se. At the very least, documentation should be refined in a follow-up.
Co-authored-by: Nicholas A. Yager <[email protected]>
Co-authored-by: Nicholas A. Yager <[email protected]>
Opening up a PR to collaborate on the
split
command!Right now, this command:
select/exclude/selector
to identify which resources should be movedsubproject.initialize()
method.The
initialize()
method:.sql
,.py
, or.csv
file, move that file into the subdirectory, then move the resource's yml entry, if necessary.yml
defined resources, copy over only the resource yml entry.dbt_project.yml
These changes requires some new
DbtYmlEditor
methods, especially to handle the nuance of source files and make our methods more generic to other resource types that follow similar patterns. Additionally, there's a lot of variable renaming for accuracy's sake, so apologies in advance for lots of edits.Things this command does not do, but will need to before shipping:
group
s at all?)dependencies.yml
file to the parent projectI would really love some input here on syntax, logic, and design. This is not at all complete, but