Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoints: flag to exp push && push #6182

Closed
casperdcl opened this issue Jun 15, 2021 · 9 comments · Fixed by #6332
Closed

checkpoints: flag to exp push && push #6182

casperdcl opened this issue Jun 15, 2021 · 9 comments · Fixed by #6332
Assignees
Labels
A: experiments Related to dvc exp feature is a feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension

Comments

@casperdcl
Copy link
Contributor

casperdcl commented Jun 15, 2021

Provide a flag (e.g. dvclive.next_step(push=True) or make_checkpoint(push=True) or DVC_EXP_AUTO_PUSH=<git_remote_name_or_url|true>) to push things per-checkpoint.

Helps with resuming a long-running CI job after timeout.

Part of iterative/cml#560.

@casperdcl casperdcl added the product: VSCode Integration with VSCode extension label Jun 15, 2021
@casperdcl casperdcl added A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do labels Jun 16, 2021
@efiop efiop added the feature is a feature label Jun 28, 2021
@pared pared assigned pared and karajan1001 and unassigned pared Jun 29, 2021
@karajan1001
Copy link
Contributor

karajan1001 commented Jul 10, 2021

@efiop, Some analysis after looking into it.

  1. This is a DVC only option. Shouldn't do any change in dvclive which would make it rely on DVC.
  2. It would be running after every turn of the stage running,  we can create a new monitor task for it like other dvclive ones for it.
  3. A new parameter in the stage file is needed to control this process.

Any suggestions for this?

@dberenbaum
Copy link
Collaborator

Can this be an argument to make_checkpoint()?

@pmrowla
Copy link
Contributor

pmrowla commented Jul 11, 2021

Given that this is a CI thing, it seems like maybe it should just be set through an env var?

something like like DVC_EXP_AUTO_PUSH=<git_remote_name_or_url>

@pmrowla
Copy link
Contributor

pmrowla commented Jul 13, 2021

Also note that for CML purposes, I think we will want both DVC cache and run-cache to be included with the automated exp push (DVC cache is already included by default on exp push but run cache is normally optional)

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 22, 2021

Is this going to quickly blow up the number of exp refs on Github? Do we need some mechanism to delete once merged/closed?

Related:
#6006
#6329 (distant cousin 😄 )

@casperdcl
Copy link
Contributor Author

dvc exp gc --checkpoints?

@dberenbaum
Copy link
Collaborator

I think there's a few potential issues:

  1. For experiments with many checkpoints, I may want to squash the checkpoint commits so that we only keep the final checkpoint results, which I'm assuming is what you meant with dvc exp gc --checkpoints @casperdcl?
  2. For any experiment cml-pr that I close because the performance is bad, there's a bug, or any other reason, I may want to delete the experiment reference entirely.
  3. Similarly, for any experiment cml-pr that I merge, I may want to delete the experiment reference since it's now duplicated (this is the connection I was thinking wrt exp branch: Add a flag --remove or --delete to remove the persisted experiment  #6329).

The latter two could be automatically handled in the cml workflow, or they could be handled with dvc exp gc --cloud or dvc exp gc origin to occasionally clean up as needed. These would also drop the experiment references for the checkpoints mentioned in the first issue, but the commits would still be in the merged branch unless we have some way to squash them before merging.

@pmrowla
Copy link
Contributor

pmrowla commented Jul 23, 2021

Is this going to quickly blow up the number of exp refs on Github?

In terms of "will this break git/github", this is not an issue, there's no real limit to the number of refs git can handle, and the server will automatically pack them as needed.

But yes, it will inflate the number of exps that would show up when users use exp list w/their server URL

Do we need some mechanism to delete once merged/closed?

These commits go into the CML PR once the run is completed, correct? At that point, it seems to me that the exp ref is no longer needed and can just be removed (it doesn't matter whether or not the PR has been merged yet, the PR is it's own git ref).

These would also drop the experiment references for the checkpoints mentioned in the first issue, but the commits would still be in the merged branch unless we have some way to squash them before merging.

If PRs are being merged on the github side, github already provides the option to squash before merging

@dberenbaum
Copy link
Collaborator

Great, so it seems like all that's needed is a mechanism to delete the exp refs once the pr is ready?

karajan1001 added a commit that referenced this issue Aug 12, 2021
* Push running checkpoint to remote

Fix #6182

* Update dvc/repo/experiments/executor/base.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* Get the branch name

* Finish this PR

1. move env name to dvc/env.py.
2. add some tests for it.

* Update after finally commit

* Some problems found in the review

1. change the behavior of self remote
2. do not use string DVC_EXP_AUTO_PUSH
3. downgrade the logger level in auto push
4. use full branch ref

* Update dvc/repo/experiments/executor/base.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* add a remote check before auto push

* Split current one env into two

* Rename DVC_EXP_CHECKPOINT_PUSH to DVC_EXP_AUTO_PUSH

* value error

* Remove getenv_bool switch to env2bool

* Better on handling Dulwich exceptions

* Name changing removing prints

* Update dvc/env.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* Some changes in review

* Add validation to repo URL

* Add new type of SCMError

* Update dvc/repo/experiments/executor/base.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* Use spy tests call counts and args

* Update dvc/scm/git/backend/base.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* Update dvc/scm/git/backend/dulwich.py

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove validate_git_repo to validate_git_remote

Co-authored-by: Peter Rowlands (변기호) <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature is a feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants