Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init: add subdir description #1022

Merged
merged 4 commits into from
Mar 9, 2020
Merged

Conversation

pared
Copy link
Contributor

@pared pared commented Feb 25, 2020

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please chose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Related to iterative/dvc#3257

@shcheklein shcheklein temporarily deployed to dvc-landing-multiple-dv-wbncgy February 25, 2020 09:11 Inactive
Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review with some relevant questions for @pared. Please take a look and address what you can (at least the Qs).

We can and will take it over too, but not sure how long it will take to get to this. Thanks!

public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
Comment on lines 33 to 34
- `--subdir` - initialize <abbr>DVC repository</abbr> in current directory and
allow to search for Git repository in parent directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows initializing a <abbr>DVC repository</abbr> inside a directory of a Git repo. DVC in this _subdir_ DVC repo will search for Git in parent directories. This option should not be combined with `--no-scm` (above).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions though, @pared:

  • Can the parent repo be a DVC repo (having both .git/ and .dvc/)?
  • Will it traverse the hierarchy all the way up to / looking for .git/?

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel

  • parent can be a DVC repo
  • yes, if there is no no-scm inside config, we will search the whole tree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add some notes about these details 🙂

public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
public/static/docs/command-reference/init.md Outdated Show resolved Hide resolved
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-multiple-dv-wbncgy February 26, 2020 02:15 Inactive
@shcheklein shcheklein temporarily deployed to dvc-landing-multiple-dv-wbncgy February 26, 2020 08:47 Inactive
@pared pared force-pushed the multiple_dvc_roots_support branch from ce56664 to ecf994b Compare February 26, 2020 09:38
@shcheklein shcheklein temporarily deployed to dvc-landing-multiple-dv-wbncgy February 26, 2020 09:39 Inactive
@pared
Copy link
Contributor Author

pared commented Feb 26, 2020

@jorgeorpinel introduced your suggestions

jorgeorpinel
jorgeorpinel previously approved these changes Feb 26, 2020
Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @pared!

@shcheklein there's a few possible improvements to this but I think is mergeable; I can address them in a later PR.

@shcheklein
Copy link
Member

@pared @jorgeorpinel

I was not following the DVC core PR, could you please clarify the following things. No need to add them right now in the PR, but the will help in taking this PR over. Would be great to have some answers here in any way your prefer :)

  1. Did we change anything about --no-scm (do we write configs, flags now)?
  2. Same question - do we write anything when we use subdir.
  3. Can I do nesting - one subdir, inside another, etc
  4. Can I do subsid in the root. Does it makes sense? What will happen.
  5. How having subdirs affect the flow (basically, why do we need them in the first place). Flow of the regular commands like pull, push, etc.
  6. What happens if I do no-scm with subdir?

Let's start with this. I think we need more clarify for this before we merge this. The way we've done it so far is too low level and technical, not clear the motivation and not clear how does it affect behavior.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared Please, see the comment. No need to change anything in the PR yet, just asking to share a bit more context.

@jorgeorpinel
Copy link
Contributor

  1. What happens if I do no-scm with subdir?

@shcheklein I had this same question in iterative/dvc#3257 (review), and Pawel's answer is

it is technically possible to do that. Still, I would leave this check (both options can't be used together) as using both flags at the same time indicates a misunderstanding of how they work.

@pared
Copy link
Contributor Author

pared commented Feb 27, 2020

@shcheklein
Answering your questions:

1. Did we change anything about --no-scm (do we write configs, flags now)?

Yes, now no-scm is a config flag now, and that change comes with behavior change:
now if no-scm is found in the config, we will use NoSCM class, always. Before it was possible to init with this flag and later initialize git, which was acknowledged. The config flag is no_scm. It can be changed using config:

dvc config core.no_scm True

2. subdir Same question - do we write anything when we use it.

No, subdir is the default behavior (unless no-scm is in config)

(2) Could you elaborate a bit please? What do you mean by default behavior? And why no-scm in config (and what config - in the same dir, in the root, global?) changes this?

Now logic works following way: if we do not find no_scm flag inside the config, we assume that
"There must be git somewhere" and proceed to find it similar way as git does, so we check current directory, if it is not git repository, we proceed with the check to the parent directory and so on...
The check for no_scm is done when initializing Repo instance, loading from repo's config, so it is possible to set no_scm globally.
I guess the confusion comes from the flag naming, initially, I named it --subrepo and what this flag tells dvc is "initialize dvc here(in current dir), but you are allowed to look for git in parent repositories"
That is also why this flag is mutually exclusive with --no-scm flag: no scm tells dvc not to use scm.
Discussion on flag naming can be found here: iterative/dvc#3257

3. Can I do nesting - one subdir, inside another, etc

Yes

4. Can I do subdir in the root. Does it makes sense? What will happen.

Yes, nothing will change, we will look for git root starting from root, .git will be found. So we can say that we are taking "closest" SCM in tree. In a way, a normal use case (Git + DVC in one dir) is just a specific subdir case.

(4) So does it mean that when I do dvc init in some subdirectory (no --subdir flag) it will create it will try to find the root and create .dvc in the root of the project?

No, this behavior stays as it was, so if we don't specify --subdir, we don't allow dvc to look for git up the tree. In case you described (just dvc init in subdir) we will get the InitError. Flag is here to make sure the user intentionally wants to create dvc repo below git.

5. How having subdirs affect the flow (basically, why do we need them in the first place).

Flow of Subdir does not influence normal use case, data sync operations will not be performed in subrepos, which are ignored during tree traversing.

does not influence normal use case

(5) what do you mean by normal here?

Sorry for that, what I meant, data sync commands without specific target. Stages for subdirs will not be gathered.

(5) _ can I still run dvc pull subdir/some.dvc from the root?_

Yes, though we will try to use current repo to perform the operation. That is a flaw, I did not think that through.

(5) what about dvc pipleine show, dvc metrics show, dvc gc, dvc diff etc, etc?

As above, we will receive some ambiguous errors.

6. What happens if I do no-scm with subdir?

Init error. Without this error, we would end up with NoSCM case, because it just happens that check for no-scm is earlier in code than searching for Git

Q: what was the motivation behind this? why do they contradict each other?

--subdir tells dvc to search for git, --no-scm tells it not to use scm. I guess --subdir might be, in the end, causing confusion as to what it does.

@shcheklein
Copy link
Member

@pared thanks! please, check the next round of questions to clarify the PR! Any details would be helpful. You can edit the same comment #1022 (comment)

@pared
Copy link
Contributor Author

pared commented Feb 28, 2020

@shcheklein Tried to provide some more broad answer.
You found a bug in the current implementation. I believe that we can proceed in 2 ways:

  1. Try to implement logic to support calling particular stage files and be aware that it can be in subdir.
  2. Forbid initializing subdir if dvc repo is found while searching for git root.

I believe that second option is appropriate, as original issue was about having multiple dvc repos inside one git repo.

@jorgeorpinel jorgeorpinel dismissed their stale review February 28, 2020 18:16

Sounds like we need several more changes.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 28, 2020

(1) The config flag is no_scm. It can be changed using config:
dvc config core.no_scm True

  • Should be added to config ref

(2) subdir is the default behavior (unless no-scm is in config)... if we do not find no_scm flag inside the config, we assume that "There must be git somewhere"

Hmmm. I tried dvc init --subdir in an empty dir. Got ERROR: failed to initiate DVC - /.../test is not tracked by any supported SCM tool (e.g. Git). Also you later say "if we don't specify --subdir, we don't allow dvc to look for git up the tree" so I still don't understand what you mean by "subdir is the default behavior" 🙁

(4) ... subdir in the root?
Yes, nothing will change

👍

  • I'd just add this in the option description: "If run in a project root, this option is ignored."

(5) data sync commands without specific target. Stages for subdirs will not be gathered

Kind of like Git subrepos?

  • Important info to put in cmd refs as well 🙂

(5) run dvc pull subdir/some.dvc from the root... will try to use current repo to perform the operation.
That is a flaw... You found a bug

Should we open a separate dvc core issue to discuss #1022 (comment) ?

(6) why do they contradict each other?
I guess the confusion comes from the flag naming, initially, I named it --subrepo

--subrepo may also confusing, it sounds like its a full repo inside a full repo, but its not a full repo per se: its a DVC project inside a subdirectory of a Git/DVC repo. Maybe --sub-scm ? (I'd consider renaming --no-git and --sub-git)

Could making both --no-scm and this option automatic/implicit help?

@pared
Copy link
Contributor Author

pared commented Mar 3, 2020

@jorgeorpinel

so I still don't understand what you mean by "subdir is the default behavior"

It is the default behavior when handling SCM inside our code. init command is a different thing.
Yours did fail, because you tried initializing inside empty directory, and no git was up the tree right?

I'd just add this in the option description: "If run in a project root, this option is ignored."

Agree

Kind of like Git subrepos?

Yes

Should we open a separate dvc core issue to discuss #1022 (comment) ?

yes, i will do that

Could making both --no-scm and this option automatic/implicit help?

Probably yest, though I would discuss it, as for now --subdir guards user from unintentionally creating repo down the tree.

@pared
Copy link
Contributor Author

pared commented Mar 3, 2020

@jorgeorpinel

Important info to put in cmd refs as well

I would wait with adding information about that, as it's not decided yet whether we will support repo inside repo scenario after detecting the "targeted sync" bug.

Also, introduced the rest of your points. Please ping me if its ok, Ill squash them.

@pared pared force-pushed the multiple_dvc_roots_support branch from ecf994b to ddb46d4 Compare March 3, 2020 12:03
@shcheklein shcheklein requested a review from jorgeorpinel March 5, 2020 01:58
@shcheklein
Copy link
Member

@jorgeorpinel @pared I've update the doc and added more information about this options, why they are needed, etc. Please review and let me know WYT.

@jorgeorpinel feel free to put small fixes on top/merge/etc.

@shcheklein
Copy link
Member

May be we should consider moving some part to the User Guide.

@shcheklein shcheklein temporarily deployed to dvc-landing-multiple-dv-lm6q2t March 5, 2020 02:29 Inactive
@pared
Copy link
Contributor Author

pared commented Mar 5, 2020

@shcheklein I think those changes are desirable, they point "main" use case and provide info on additional possible workflows.

@shcheklein

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor

I'm back guys, sorry for the delay.

I still don't understand what you mean by "subdir is the default behavior"
It is the default behavior when handling SCM inside our code. init command is a different thing.
... --subdir guards user from unintentionally creating repo down the tree

@pared OK so it doesn't affect the docs in this PR but I'm still trying to understand what "default behavior when handling SCM inside our code" means or what impacts it has. Can you elaborate a little? Maybe it impacts other docs...

Other than that we've taken over this PR and I'm trying to merge it ASAP. Thanks

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will push some copy edits so this can be merged ASAP.

But here's a list of stuff pending from comments above:

  • Update sync commands per "data sync commands without specific target: Stages for subdirs will not be gathered"
  • May be we should consider moving some part to the User Guide.

I also think (already discussed with Ivan) that the result here is a little too long. We went from 2 paragraphs to 21 plus block codes. It's not necessarily a problem because all of this info is valuable, but I think in part it can be summarized at least a little, and possibly extract these advanced cases into the user guide. We even state that they are "rare" so I'm not sure it's worth explaining them in detail in the description of this cmd ref. Another option is to make them into examples and keep the direct links from the description. Food for thought.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-multiple-dv-lm6q2t March 9, 2020 20:06 Inactive
@jorgeorpinel jorgeorpinel requested a review from shcheklein March 9, 2020 20:09
@jorgeorpinel jorgeorpinel merged commit efc377a into master Mar 9, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 9, 2020

Merged 🎉 thanks again @pared.

Let's continue the discussion in #1039, and address depending on priority.

Feel free to review my latest changes here and leave feedback in/for that continuation issue.

Thanks

@efiop efiop deleted the multiple_dvc_roots_support branch March 10, 2020 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants