Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp init: create workspace structure by default, show workspace tree #7331

Merged
merged 5 commits into from
Feb 17, 2022

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Feb 1, 2022

This PR creates the workspace structure by default. That means that the code paths and data paths are created if they don't exist. If the path has any extension, the empty file will be created, otherwise, a directory is created.

Initially, this was to be created for the defaults only, but I have added the magic everywhere, so this is applied to passed arguments, defaults or config values too.

Second thing that PR adds is this output so that users are aware of the changes and their workspace structure:

Creating experiment project structure:
├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── src

Closes #7137, closes #7138 and closes #7139.

@skshetry skshetry marked this pull request as ready for review February 7, 2022 08:27
@skshetry skshetry requested a review from a team as a code owner February 7, 2022 08:27
@skshetry skshetry requested review from pared and dberenbaum February 7, 2022 08:27
@skshetry skshetry added A: experiments Related to dvc exp enhancement Enhances DVC ui user interface / interaction labels Feb 7, 2022
@skshetry skshetry self-assigned this Feb 7, 2022
@skshetry
Copy link
Member Author

skshetry commented Feb 7, 2022

We may want to reconsider the output for interactive mode now, especially for the one that shows workspace structure, since we create some of the files by default:

This command will guide you to set up a train stage in dvc.yaml.
See https://s.dvc.org/g/pipeline-files.

Enter the paths for dependencies and outputs of the command.
DVC assumes the following workspace structure:
├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── src

Path to a code file/directory [src, n to omit]:
'src' does not exist, the directory will be created.
Path to a data file/directory [data, n to omit]:
'data' does not exist, the directory will be created.
Path to a model file/directory [models, n to omit]:
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]:

@dberenbaum
Copy link
Collaborator

Thanks, @skshetry! What about params? Will they be added separately after #4112? It might be confusing to explain what paths do and don't get created automatically. At least if params are created, it can be simplified to "inputs get created, outputs don't."

@dberenbaum
Copy link
Collaborator

Looks good!

Two issues to consider:

  1. "Creating experiment project structure" may be misleading because outputs are not created, and inputs are only created if they don't already exist.
  2. I agree that changes to the interactive output would be helpful, and specifically we should try to make interactive and non-interactive mode more consistent with each other.

Maybe we can just make the output more general?

$ dvc exp init --data raw python src/train.py
Experiment project structure:
├── metrics.json
├── models
├── params.yaml
├── plots
├── raw
└── src

'raw' does not exist, the directory will be created.
'src' does not exist, the directory will be created.
...
$ dvc exp init -i python src/train.py
Experiment project structure:
├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── src

Path to a code file/directory [src, n to omit]:
'src' does not exist, the directory will be created.
Path to a data file/directory [data, n to omit]: raw
'raw' does not exist, the directory will be created.
Path to a model file/directory [models, n to omit]:
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]:
...

@dberenbaum dberenbaum mentioned this pull request Feb 8, 2022
10 tasks
@skshetry
Copy link
Member Author

skshetry commented Feb 9, 2022

Thanks, @skshetry! What about params? Will they be added separately after #4112? It might be confusing to explain what paths do and don't get created automatically. At least if params are created, it can be simplified to "inputs get created, outputs don't."

yeah, I was thinking of just doing that after #4112 as it'll be noop. But could do that now too, as params.yaml file is still read by default and is still usable.

@skshetry
Copy link
Member Author

skshetry commented Feb 9, 2022

'raw' does not exist, the directory will be created.
'src' does not exist, the directory will be created.

For non-interactive mode, I'll just merge these into a single message:

Creating raw, src and params.yaml

@skshetry
Copy link
Member Author

skshetry commented Feb 9, 2022

Also, should the message be following?

Using/Expecting/... Experiment project structure:

@skshetry
Copy link
Member Author

skshetry commented Feb 9, 2022

I have decided to not create params.yaml by default in this PR as it requires some changes and thoughts on interactive mode. Let's do that in a successive PR. @dberenbaum

@dberenbaum
Copy link
Collaborator

For interactive mode, I'll just merge these into a single message:

Do you mean for non-interactive mode, or am I misunderstanding?

Also, I would just keep them separate unless you have a strong reason to merge them. I don't see an exact guideline for this, but I prefer a single piece of information per line. It keeps the clarity of whether each path is a file or directory, it's more consistent, and it's easier to parse.

Using/Expecting/... Experiment project structure:

I think "Using" is fine.

@skshetry
Copy link
Member Author

skshetry commented Feb 9, 2022

For interactive mode, I'll just merge these into a single message:

Do you mean for non-interactive mode, or am I misunderstanding?

Yeah sorry, I meant non-interactive mode.

Also, I would just keep them separate unless you have a strong reason to merge them. I don't see an exact guideline for this, but I prefer a single piece of information per line. It keeps the clarity of whether each path is a file or directory, it's more consistent, and it's easier to parse.

Right, but is something being a file or a directory important information? It may cause visual noise without adding any meaningful information. I think that we created something is more important.

This is what the current implementation displays:

Creating data and src

@dberenbaum
Copy link
Collaborator

I have decided to not create params.yaml by default in this PR as it requires some changes and thoughts on interactive mode.

You could also address that in a separate PR first and come back to this if it's easier since it would make params have less special behavior that needs to be handled.


I'm seeing inconsistent behavior when using flags with/without interactive mode:

Example 1:

$ dvc exp init -f --data raw python src/train.py
Using experiment project structure:
├── metrics.json
├── models
├── params.yaml
├── plots
├── raw
└── src

Creating # What happened here? I expected it to say 'Creating raw.'
Created train stage in dvc.yaml. To run, use "dvc exp run".
See https://s.dvc.org/g/exp/run.

$ rmdir raw

$ dvc exp init -if --data raw python src/train.py
This command will guide you to set up a train stage in dvc.yaml.
See https://s.dvc.org/g/pipeline-files.

Enter the paths for dependencies and outputs of the command.
Using experiment project structure:
├── metrics.json
├── models
├── params.yaml
├── plots
├── raw
└── src

Path to a code file/directory [src, n to omit]:
Path to a model file/directory [models, n to omit]:
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]:

─────────────────────────────────────────────────────────────────────────────────────────
train:
  cmd: python src/train.py
  deps:
  - raw
  - src
  outs:
  - models
  metrics:
  - metrics.json:
      cache: false
  plots:
  - plots:
      cache: false

Do you want to add the above contents to dvc.yaml? [Y/n]:
Created train stage in dvc.yaml. To run, use "dvc exp run".
See https://s.dvc.org/g/exp/run.

$ ls raw # This is created, but why wasn't there a message about it during `dvc exp init -i`?

Example 2:

$ dvc exp init -f --params params.txt python src/train.py
ERROR: Parameters file 'params.txt' does not exist.

$ This command will guide you to set up a train stage in dvc.yaml.
See https://s.dvc.org/g/pipeline-files.

Enter the paths for dependencies and outputs of the command.
Using experiment project structure:
├── data
├── metrics.json
├── models
├── params.txt
├── plots
└── src

Path to a code file/directory [src, n to omit]:
Path to a data file/directory [data, n to omit]:
Path to a model file/directory [models, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]:
ERROR: Parameters file 'params.txt' does not exist. # Why wait until the end to validate this?

We spent a lot of time developing the interactive mode initially, but I think that led to problems with the non-interactive mode and now inconsistencies between them. Do we need such different implementations for interactive and non-interactive modes? I think it creates a more consistent UI and reduces the risk of bugs or odd behavior in either mode if we combine as much as possible.

I would prefer a unified workflow like:

  1. Get template to determine expected defaults. Print default tree if in interactive mode. I would disallow --data and other override flags in interactive mode (I think it's confusing, overly complicated, and I can't imagine why someone would need this).
  2. Iterate over each dependency and output to collect overrides (from prompts in interactive mode) and create dependencies as needed (output should show creation of each as it happens).
  3. Pass context to stage create method. I think we should print the same thing in both modes here, most likely the updated tree.

I know this is asking for large changes, but hopefully it will be simpler and raise fewer issues/questions once complete. Let me know if you want to have a call to discuss.

@skshetry
Copy link
Member Author

skshetry commented Feb 16, 2022

I have decided to not create params.yaml by default in this PR as it requires some changes and thoughts on interactive mode.

You could also address that in a separate PR first and come back to this if it's easier since it would make params have less special behavior that needs to be handled.

This PR handles "all inputs are created", so this can already create params by empty. We just need to get rid of those checks that we have before. I don't want to make too many decisions or changes in a single PR. And I need this PR before we move to params, as it unifies input creations.

I'm seeing inconsistent behavior when using flags with/without interactive mode:

Great catch with the UI messages. The first one is an oversight on this PR. "no message" on overrides has been fixed, it'll be shown even in interactive and non-interactive mode.

The params issue will be fixed after we remove the checks for params file existence.

I would prefer a unified workflow like:

1. Get template to determine expected defaults. Print default tree if in interactive mode. I would disallow `--data` and other override flags in interactive mode (I think it's confusing, overly complicated, and I can't imagine why someone would need this).

This is something that will rarely be used. I prefer it as being an override, as ignoring them may be more confusing. Although I have no strong opinion.

2. Iterate over each dependency and output to collect overrides (from prompts in interactive mode) and create dependencies as needed (output should show creation of each as it happens).

Yeah, fixed this. Thanks.

3. Pass context to stage create method. I think we should print the same thing in both modes here, most likely the updated tree.

Makes sense, but how should we improve message before the prompts in an interactive mode?

@dberenbaum
Copy link
Collaborator

1. Get template to determine expected defaults. Print default tree if in interactive mode. I would disallow `--data` and other override flags in interactive mode (I think it's confusing, overly complicated, and I can't imagine why someone would need this).

This is something that will rarely be used. I prefer it as being an override, as ignoring them may be more confusing. Although I have no strong opinion.

Yeah, I agree that ignoring the arguments is a bad option. I meant that --interactive should be mutually exclusive with --data and other overrides (including --explicit). It signals that the overrides are intended for non-interactive mode.

It can also reduce some of the init_interactive() logic. It wouldn't be necessary to adjust the prompts or workspace tree. That would help ensure consistency between interactive and non-interactive modes.

3. Pass context to stage create method. I think we should print the same thing in both modes here, most likely the updated tree.

Makes sense, but how should we improve message before the prompts in an interactive mode?

IMO we can get rid of printing the tree before the prompts to simplify, especially if we are printing the updated tree at the end.

I also think we can get rid of all the if interactive/if not interactive conditionals after repo.stage.create() and always print:

Using experiment project structure:
├── metrics.json
├── models
├── mydata
├── params.yaml
├── plots
└── src

Created train stage in dvc.yaml. To run, use "dvc exp run".
See https://s.dvc.org/g/exp/run.

Having a confirmation prompt in interactive mode after printing the tree seems okay if you think it's helpful to keep that, but I don't know that it's necessary.

@skshetry
Copy link
Member Author

Yeah, I agree that ignoring the arguments is a bad option. I meant that --interactive should be mutually exclusive with --data and other overrides (including --explicit). It signals that the overrides are intended for non-interactive mode.

It can also reduce some of the init_interactive() logic. It wouldn't be necessary to adjust the prompts or workspace tree. That would help ensure consistency between interactive and non-interactive modes.

Right, but if you are using --data with --interactive, I think it's better to honor their request and think of it as an override, considering we already have support for that and documented the behavior in CLI help as:

  --interactive, -i     Prompt for values that are not provided

IMO we can get rid of printing the tree before the prompts to simplify, especially if we are printing the updated tree at the end.

I'll do this in the next PR.

I also think we can get rid of all the if interactive/if not interactive conditionals after repo.stage.create() and always print:

Yeah, let's get rid of it then. Also in the next PR. Thanks.

@skshetry
Copy link
Member Author

skshetry commented Feb 17, 2022

I am merging this PR. This is a release-blocker Blocks a release now until few issues that we are discussing are merged, especially the params one.

cc @efiop

@skshetry skshetry merged commit 497eac9 into iterative:main Feb 17, 2022
@skshetry skshetry deleted the exp-init-workspace branch February 17, 2022 08:33
@skshetry skshetry restored the exp-init-workspace branch April 27, 2022 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp enhancement Enhances DVC ui user interface / interaction
Projects
No open projects
Archived in project
3 participants