Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: implement Google Drive #2040

Closed
wants to merge 2 commits into from
Closed

Conversation

ei-grad
Copy link
Contributor

@ei-grad ei-grad commented May 22, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


FIx #2018

@ei-grad ei-grad changed the title remote: implement Google Drive [WIP] remote: implement Google Drive May 22, 2019
@ei-grad ei-grad force-pushed the google-drive branch 5 times, most recently from 4524712 to d249f69 Compare May 24, 2019 10:45
@ei-grad
Copy link
Contributor Author

ei-grad commented May 24, 2019

Review can be started - dvc pull / dvc push is working, I'm proceeding with tests.

dvc/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ei-grad Looks great! 🔥 Is it possible to do some func tests locally for gdrive? Or do we need a real drive account? Also, looks like upload/download could be refactored a bit so they are easier to read ;) I know that many other Remotes have a similar problem with upload/download methods, but just while we are at it, we could enhance this particular one a little bit by splitting it into separate sub-methods.

@ei-grad
Copy link
Contributor Author

ei-grad commented May 27, 2019

@efiop Thanks for the review! :)

Is it possible to do some func tests locally for gdrive? Or do we need a real drive account?

I think we need a real account. And IIRC I read somewhere in their API docs that it is not prohibited by google policy to create a test account for this purpose. But I think it is also good to have a full unittest coverage on this implementation.

upload/download could be refactored a bit so they are easier to read

Sure. Btw, it is untestable now also. In progress.

@ei-grad ei-grad changed the title [WIP] remote: implement Google Drive remote: implement Google Drive May 30, 2019
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

tests/func/test_gdrive.py Outdated Show resolved Hide resolved
dvc/path/gdrive.py Outdated Show resolved Hide resolved
tests/func/test_gdrive.py Outdated Show resolved Hide resolved
tests/func/test_gdrive.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
@ei-grad ei-grad force-pushed the google-drive branch 2 times, most recently from 87ad701 to 6d15884 Compare June 1, 2019 13:08
@ei-grad ei-grad requested a review from efiop June 1, 2019 13:09
@ei-grad ei-grad force-pushed the google-drive branch 5 times, most recently from cfea397 to 620bb4e Compare June 1, 2019 18:50
@ei-grad ei-grad force-pushed the google-drive branch 4 times, most recently from ac19c8a to 6b758c6 Compare July 6, 2019 23:40
@ei-grad
Copy link
Contributor Author

ei-grad commented Jul 16, 2019

Just a status update - the latest feedback points were addressed, only couple of questions remain. It was probably not the right thing for me to mark the review conversations as resolved by me, sorry. Anyway I'm in the process of PyDrive-related refactoring changing a notable portion of code, and
I'll probably will come up with a new code review request later this week.

@vmarkovtsev
Copy link

Hey @ei-grad thank you so much for working on this! If you would benefit from any help, e.g. writing tests or coding, please let me know.

@efiop
Copy link
Contributor

efiop commented Aug 2, 2019

@ei-grad Please take a look at DeepSource complaints, I believe there are some valid ones.



@pytest.fixture()
def repo():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use repo from the dvc code repo itself? Or can we use dvc_repo fixture, as we do everywhere else?

Copy link
Contributor Author

@ei-grad ei-grad Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit tests doesn't need the real repo, so it is a bit excessive to setup/teardown the dvc_repo fixture for each test. But it may be a good idea to create one temporary just for them all. Would it be ok to make this fixture scope="module" and take/return dvc_repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good :)

"{} is not a folder".format("/".join(current_path))
)
parent = metadata["id"]
to_create = [part] + list(parts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be sublist of parts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parts is a partially consumed iterator, but yeah, I'll rewrite it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception/break condition was also valid, but DeepSource suggests too that iterating over iterator with the break/else construct and using the loop variable is something not readable and bug-risky. :)

TIMEOUT = (5, 60)

def __init__(
self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use kwargs instead of a long list here?

Copy link
Contributor Author

@ei-grad ei-grad Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'd rather pass the OAuth2 instance instead of its arguments. And this method also needs a docstring, probably.

Security notice:

It always adds the Authorization header to the requests, not paying
attention is request is for googleapis.com or not. It is just how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: is -> if

def session(self):
"""AuthorizedSession to communicate with https://googleapis.com

Security notice:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably easy to add a test/assert to check that domain is intact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks! What do you think if I'd just override the request() of AuthorizedSession?

@@ -0,0 +1,17 @@
from dvc.remote.gdrive.utils import response_error_message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should include from __future__ import unicode_literals everywhere

creds_id = self._get_creds_id(info["installed"]["client_id"])
return os.path.join(creds_storage_dir, creds_id)

def _get_storage_lock(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify this a little bit? why do we need the lock, and how does it work.

self._thread_lock.acquire()
while time() - t0 < self.timeout:
try:
self._lock = zc.lockfile.LockFile(self.lock_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a regular lockfile or is there something specific? we are using lockfile already somewhere, a different implementation, so do we need zc here? should we specify the dependency explicitly then? Also, what happens if we break the execution in the middle, it will start raising an exception? should we at least explain how to recover from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein We are using zc.lockfile in other places too :) Not sure about the purpose of this lock though, do we actually write to the file it is protecting anywhere @ei-grad ?

break
params["pageToken"] = data["nextPageToken"]

def get_metadata(self, path_info, fields=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add comment what does it return? really hard to understand this. just trying to understand why logic is so complicated here

errors_count += 1
if errors_count >= 10:
raise
sleep(1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need this sleep here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a must to wait some time between consecutive resumable upload requests. If one request would fail due to a short network problem then it would end up with all retry attempts failed in a short time. Maybe the same exponential backoff policy should be used here, as it is for error handling in self.request. Though it is not so clear for me would it be the right solution for resumable upload or not. The hardcoded 1 second sleep with 10 retries looks better, imho, but it is also not the best behavior, definitely.

One possible solution could be to store the upload process in the DVC's state database to make it possible to resume uploads between the dvc runs, but this feels like an overkill for me. Other backends don't care about the connection interruptions / server errors during large files upload at all, if I'm not mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. DB is an overkill for sure. As for retries - is it possible to use some decorator out of many existing? In both cases.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions to address

@efiop @ei-grad what else do we need to get it done, guys?

@ei-grad you wanted to try some library as far as I remember, what's your take on it?

@shcheklein
Copy link
Member

@ei-grad please check DeepSource stuff as well, let's fix it (except obvious false positives).

def exists(self, path_info):
return self.client.exists(path_info)

def batch_exists(self, path_infos, callback):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop do we need to update anything to support threading here - I mean status, etc? you are changing something with @pared as far as I understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, batch_exists will be no longer needed after #2375 . As to threads in general, as long as self.client is thread-safe(looks like it is, but maybe @ei-grad could confirm/deny that) we will be fine.

@efiop efiop closed this Sep 29, 2019
@efiop
Copy link
Contributor

efiop commented Sep 29, 2019

Closing due to inactivity in favor of #2551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Drive
6 participants