-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: implement Google Drive #2040
Conversation
4524712
to
d249f69
Compare
Review can be started - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ei-grad Looks great! 🔥 Is it possible to do some func tests locally for gdrive? Or do we need a real drive account? Also, looks like upload/download could be refactored a bit so they are easier to read ;) I know that many other Remotes have a similar problem with upload/download methods, but just while we are at it, we could enhance this particular one a little bit by splitting it into separate sub-methods.
@efiop Thanks for the review! :)
I think we need a real account. And IIRC I read somewhere in their API docs that it is not prohibited by google policy to create a test account for this purpose. But I think it is also good to have a full unittest coverage on this implementation.
Sure. Btw, it is untestable now also. In progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
87ad701
to
6d15884
Compare
cfea397
to
620bb4e
Compare
ac19c8a
to
6b758c6
Compare
Just a status update - the latest feedback points were addressed, only couple of questions remain. It was probably not the right thing for me to mark the review conversations as resolved by me, sorry. Anyway I'm in the process of PyDrive-related refactoring changing a notable portion of code, and |
Hey @ei-grad thank you so much for working on this! If you would benefit from any help, e.g. writing tests or coding, please let me know. |
@ei-grad Please take a look at DeepSource complaints, I believe there are some valid ones. |
|
||
|
||
@pytest.fixture() | ||
def repo(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use repo from the dvc code repo itself? Or can we use dvc_repo
fixture, as we do everywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit tests doesn't need the real repo, so it is a bit excessive to setup/teardown the dvc_repo
fixture for each test. But it may be a good idea to create one temporary just for them all. Would it be ok to make this fixture scope="module"
and take/return dvc_repo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good :)
"{} is not a folder".format("/".join(current_path)) | ||
) | ||
parent = metadata["id"] | ||
to_create = [part] + list(parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be sublist of parts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parts is a partially consumed iterator, but yeah, I'll rewrite it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception/break condition was also valid, but DeepSource suggests too that iterating over iterator with the break/else construct and using the loop variable is something not readable and bug-risky. :)
TIMEOUT = (5, 60) | ||
|
||
def __init__( | ||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use kwargs instead of a long list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I'd rather pass the OAuth2 instance instead of its arguments. And this method also needs a docstring, probably.
Security notice: | ||
|
||
It always adds the Authorization header to the requests, not paying | ||
attention is request is for googleapis.com or not. It is just how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: is -> if
def session(self): | ||
"""AuthorizedSession to communicate with https://googleapis.com | ||
|
||
Security notice: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably easy to add a test/assert to check that domain is intact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thanks! What do you think if I'd just override the request()
of AuthorizedSession
?
@@ -0,0 +1,17 @@ | |||
from dvc.remote.gdrive.utils import response_error_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should include from __future__ import unicode_literals
everywhere
creds_id = self._get_creds_id(info["installed"]["client_id"]) | ||
return os.path.join(creds_storage_dir, creds_id) | ||
|
||
def _get_storage_lock(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you clarify this a little bit? why do we need the lock, and how does it work.
self._thread_lock.acquire() | ||
while time() - t0 < self.timeout: | ||
try: | ||
self._lock = zc.lockfile.LockFile(self.lock_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a regular lockfile or is there something specific? we are using lockfile already somewhere, a different implementation, so do we need zc here? should we specify the dependency explicitly then? Also, what happens if we break the execution in the middle, it will start raising an exception? should we at least explain how to recover from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein We are using zc.lockfile in other places too :) Not sure about the purpose of this lock though, do we actually write to the file it is protecting anywhere @ei-grad ?
break | ||
params["pageToken"] = data["nextPageToken"] | ||
|
||
def get_metadata(self, path_info, fields=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add comment what does it return? really hard to understand this. just trying to understand why logic is so complicated here
errors_count += 1 | ||
if errors_count >= 10: | ||
raise | ||
sleep(1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need this sleep here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a must to wait some time between consecutive resumable upload requests. If one request would fail due to a short network problem then it would end up with all retry attempts failed in a short time. Maybe the same exponential backoff policy should be used here, as it is for error handling in self.request
. Though it is not so clear for me would it be the right solution for resumable upload or not. The hardcoded 1 second sleep with 10 retries looks better, imho, but it is also not the best behavior, definitely.
One possible solution could be to store the upload process in the DVC's state database to make it possible to resume uploads between the dvc runs, but this feels like an overkill for me. Other backends don't care about the connection interruptions / server errors during large files upload at all, if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. DB is an overkill for sure. As for retries - is it possible to use some decorator out of many existing? In both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ei-grad please check DeepSource stuff as well, let's fix it (except obvious false positives). |
def exists(self, path_info): | ||
return self.client.exists(path_info) | ||
|
||
def batch_exists(self, path_infos, callback): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing due to inactivity in favor of #2551 |
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
FIx #2018