Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push running checkpoint to remote #6332
Push running checkpoint to remote #6332
Changes from 10 commits
14175c8
770e5bc
15036f3
205ce57
b6f147a
65eb8d6
5077862
dd7b380
92d9b07
7c271b3
34f4da3
37d2150
0cfd8d1
ce7cd7d
e666d2e
50d419f
5cf541c
dfcc946
060ec01
ffb657a
41aeaaf
488c7f3
4a16226
32677a2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to have this explicit check? Would this qualify as a valid git remote? Is there some reason to think a user might set
DVC_EXP_GIT_REMOTE
to their local project's root dir?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why fail for an invalid Git remote but only warn for this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git_remote
is just a variable that is either a path/URL to a git repo (including local ones) or a remote name. If this check is hit it means it's a local path that points directly to the current DVC root directory, which in most cases is a valid git repo path.This behavior can be useful - if you do
DVC_EXP_GIT_REMOTE=.
the end result would be that we auto-push DVC cache for your local experiment runs. (The git push step becomes a no-op, the same thing that happens if you do a CLIgit push
to your own current repo directory)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is helpful since a valid remote would also point to the current Git repo, right? Maybe something like "local workspace" makes more sense than "current Git repo"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local workspace is your current Git repo.
https://github.com/my/repo.git
is not your current Git repo - it's a remote Git repo, and potentially where your current Git repo is cloned from, but "current repo" would always be your current local workspace repoThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code block moved?
client.get_refs()
failing is not always going to be due to an invalid URL.get_transport_and_path
does validation on whether or not the specified git remote exists, or whether or not an explicit URL is a valid git URL.I get that you are trying to validate whether or not the URL is reachable before the experiment is run, but I don't think that is necessary at all. And even if you want to validate the URL by doing an
scm.iter_refs
call, you still don't need to move this code block to do it.(and if
client.get_refs()
fails, it would be better to pass the actual exception for why it failed up the chain, instead of just assuming it was due to an invalid URL.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the first push will happen after one iteration, stopping at that point will leave a dirty workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that's not any different than any other failure that may occur during an experiment run, or during
dvc repro
.The user's pipeline code itself could fail, or the
dvc push
could fail due to any number of reasons, whether it's because of a network error, or because the remote config is actually invalid. (We don't validate being able to connect to a DVC remote, all we do is validate that a default remote exists in the DVC config).All of these cases will result in a "dirty" workspace, which is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, users' pipeline code could fail, but they will not consider it to be a problem for us. And in some cases, training 1 epoch would cost hours or even days of time. And it will be really frustrating to get an error after training for such a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that we are not checking for other things that can fail in DVC, after the epoch has been generated.
Just as an example, we do not verify whether or not it is possible to actually push the cache data to the default DVC remote. If I have mis-configured my DVC remote using either a URL that doesn't exist, or with the wrong authentication credentials, we will fail after we have generated a commit and try to auto-push it.
If we are going to do this kind of strict validation on the git remote, then we should also be doing it on the DVC remote. But in my opinion, both of these checks would be excessive.
I also still do not think we should be failing at all in the event that the automatic push fails. We should wrap any exceptions that occur when we try to push, and log them if it fails, but continue running the pipeline. If the user has misconfigured the git URL, they can just do a manual
dvc exp push <correct_git_url> ...
after execution has completed, rather than needing to re-run an entire experiment.But this is maybe a question for @dberenbaum
(And yes, this manual push isn't possible in the CML use case, but in the CML use case the key differences are that the user has probably already tested their code locally and that the upstream git URL or remote should be coming from the CI provider itself (i.e. from the github/gitlab environment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks the changes are outdated, so it seems like some of the specifics have changed, and I'll focus on the general user experience.
I think there was a previous question from @karajan1001 about whether to validate the url before running the experiment. This validation seems useful and I see little downside from a user perspective. Since dvc knows the experiment will be pushed, it would be helpful to avoid issues before spending time to run the experiment. I also think it makes sense to throw an error if validation fails since auto pushing is specifically for cml and similar workflows, where leaving a dirty workspace isn't helpful.
It would actually seem useful to have some basic validation of both the git and dvc remotes. As @pmrowla mentioned, in cml the
DVC_EXP_GIT_REMOTE
is unlikely to be a problem since it's being set automatically. However, the dvc remote is configured by the user, and it will be problematic if the auto push fails in cml.Feel free to push back with reasons to avoid this validation or with questions of how strictly to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are different possible levels of "validation". IMO validating that a
.git/config
remote namedorigin
exists or that the string[email protected]/my/repo.git
is a valid git URL buts3://some_non_git_url
is not valid is fine. But I don't think we have to bother opening a test connection to[email protected]/my/repo.git
before trying to push to it.This is the exact same level of @karajan1001 's current validation for the DVC remote. We check that a default DVC remote exists in the repo config, but we do not do anything to verify that the remote URL is reachable, or that we have the required list/write/move/delete permissions to actually push to it.
IMO this is fine. But trying to connect to each remote before we actually run anything is overkill, and in the event that it fails we can handle it at that time.
I don't think that an automatic push errors in general should be considered a failure state for the entire workflow. A push failure is an error that the user can recover from by just re-trying
dvc exp push
again after their experiment has completed.So in the event that github or aws goes down, or that a gdrive remote gets API throttled, or any other kind of possible network error occurs (including a remote being unreachable due to a user misconfiguration), we should just log that "automatic push failed with error: ..." and continue running the pipeline.
In this case, users should be testing their remote configurations themselves locally before they git commit things into their
.dvc/config
.And in the event that they misconfigure it, we should just warn them, and continue running the experiment (for the same reasons I already mentioned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of the day I guess this is a product decision, and if we want to get into creating a test file and
dvc push
ing and then removing it, obviously we can do that. But I think it's overkill and an unnecessary extra step.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we check that it's a valid git url? I agree that limiting the validation makes sense since it might not make sense to fail for random timeouts that have nothing to do with the remote configuration. Ideally, it seems like a retry might be helpful, but that's out of scope.
I think you mentioned before how this isn't helpful in the cml context since you won't be able to access the runner and debug. Are you thinking that, if there's something like a temporary network issue, then the next checkpoint might succeed, so the workflow should continue? Or are you thinking of other scenarios like remote executors where it will be possible to debug?
If there is a temporary network error for a single checkpoint push in the middle of the workflow, failing could be bad for the user since they might not be watching a long-running job and might return later to unexpectedly find that the job did not run to completion. That's why I agree that limiting validation makes sense, and I also agree that auto push errors should warn rather than interrupt the entire job.
However, in the initial validation, failing fast seems like a good user experience since they can debug quickly and resubmit their job. For the initial validation, I think it makes sense to do what we can to check if everything is configured correctly and working as expected and fail if not.
Since this feature is for @iterative/cml, it would be good to get their feedback on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the git backends we use have calls that can do this check (to verify that either a string remote name like
origin
exists, or that a URL conforms to what git expects a git URL to look like)Basically we would just have to make this a standalone scm call:
dvc/dvc/scm/git/backend/dulwich.py
Lines 353 to 359 in fd76e09
Depending on the issue, yes it's entirely possible that after one exp commit fails to be pushed in CML, the subsequent push could succeed. But really I was just thinking of scenarios where users set up automatic pushing in general, and not necessarily in CI (this becomes relevant for local exp runs in the event that support for this gets added into dvclive)