remote: implement Google Drive #2040

ei-grad · 2019-05-22T16:23:42Z

Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.

FIx #2018

ei-grad · 2019-05-24T10:47:42Z

Review can be started - dvc pull / dvc push is working, I'm proceeding with tests.

dvc/config.py

dvc/remote/gdrive/google-dvc-client-id.json

efiop

@ei-grad Looks great! 🔥 Is it possible to do some func tests locally for gdrive? Or do we need a real drive account? Also, looks like upload/download could be refactored a bit so they are easier to read ;) I know that many other Remotes have a similar problem with upload/download methods, but just while we are at it, we could enhance this particular one a little bit by splitting it into separate sub-methods.

ei-grad · 2019-05-27T12:23:12Z

@efiop Thanks for the review! :)

Is it possible to do some func tests locally for gdrive? Or do we need a real drive account?

I think we need a real account. And IIRC I read somewhere in their API docs that it is not prohibited by google policy to create a test account for this purpose. But I think it is also good to have a full unittest coverage on this implementation.

upload/download could be refactored a bit so they are easier to read

Sure. Btw, it is untestable now also. In progress.

efiop

Looking good!

tests/func/test_gdrive.py

dvc/path/gdrive.py

tests/func/test_gdrive.py

tests/unit/remote/test_gdrive.py

dvc/remote/gdrive/__init__.py

ei-grad · 2019-07-16T16:44:27Z

Just a status update - the latest feedback points were addressed, only couple of questions remain. It was probably not the right thing for me to mark the review conversations as resolved by me, sorry. Anyway I'm in the process of PyDrive-related refactoring changing a notable portion of code, and
I'll probably will come up with a new code review request later this week.

vmarkovtsev · 2019-07-17T13:58:47Z

Hey @ei-grad thank you so much for working on this! If you would benefit from any help, e.g. writing tests or coding, please let me know.

efiop · 2019-08-02T09:06:02Z

@ei-grad Please take a look at DeepSource complaints, I believe there are some valid ones.

efiop · 2019-08-02T09:11:25Z

tests/unit/remote/gdrive/conftest.py

+
+
+@pytest.fixture()
+def repo():


Do we need to use repo from the dvc code repo itself? Or can we use dvc_repo fixture, as we do everywhere else?

This unit tests doesn't need the real repo, so it is a bit excessive to setup/teardown the dvc_repo fixture for each test. But it may be a good idea to create one temporary just for them all. Would it be ok to make this fixture scope="module" and take/return dvc_repo?

Sounds good :)

shcheklein · 2019-08-06T21:48:36Z

dvc/remote/gdrive/__init__.py

+                        "{} is not a folder".format("/".join(current_path))
+                    )
+                parent = metadata["id"]
+        to_create = [part] + list(parts)


should it be sublist of parts here?

parts is a partially consumed iterator, but yeah, I'll rewrite it

The exception/break condition was also valid, but DeepSource suggests too that iterating over iterator with the break/else construct and using the loop variable is something not readable and bug-risky. :)

shcheklein · 2019-08-06T22:09:26Z

dvc/remote/gdrive/client.py

+    TIMEOUT = (5, 60)
+
+    def __init__(
+        self,


use kwargs instead of a long list here?

Hm. I'd rather pass the OAuth2 instance instead of its arguments. And this method also needs a docstring, probably.

shcheklein · 2019-08-06T22:09:52Z

dvc/remote/gdrive/client.py

+        Security notice:
+
+        It always adds the Authorization header to the requests, not paying
+        attention is request is for googleapis.com or not. It is just how


typo: is -> if

shcheklein · 2019-08-06T22:26:07Z

dvc/remote/gdrive/client.py

+    def session(self):
+        """AuthorizedSession to communicate with https://googleapis.com
+
+        Security notice:


it's probably easy to add a test/assert to check that domain is intact

Great idea, thanks! What do you think if I'd just override the request() of AuthorizedSession?

dvc/remote/gdrive/client.py

shcheklein · 2019-08-06T22:47:31Z

dvc/remote/gdrive/exceptions.py

@@ -0,0 +1,17 @@
+from dvc.remote.gdrive.utils import response_error_message


we should include from __future__ import unicode_literals everywhere

shcheklein · 2019-08-06T23:16:10Z

dvc/remote/gdrive/oauth2.py

+        creds_id = self._get_creds_id(info["installed"]["client_id"])
+        return os.path.join(creds_storage_dir, creds_id)
+
+    def _get_storage_lock(self):


could you clarify this a little bit? why do we need the lock, and how does it work.

shcheklein · 2019-08-06T23:25:06Z

dvc/remote/gdrive/waitable_lock.py

+        self._thread_lock.acquire()
+        while time() - t0 < self.timeout:
+            try:
+                self._lock = zc.lockfile.LockFile(self.lock_file)


is it a regular lockfile or is there something specific? we are using lockfile already somewhere, a different implementation, so do we need zc here? should we specify the dependency explicitly then? Also, what happens if we break the execution in the middle, it will start raising an exception? should we at least explain how to recover from it.

@shcheklein We are using zc.lockfile in other places too :) Not sure about the purpose of this lock though, do we actually write to the file it is protecting anywhere @ei-grad ?

shcheklein · 2019-08-06T23:41:22Z

dvc/remote/gdrive/client.py

+                break
+            params["pageToken"] = data["nextPageToken"]
+
+    def get_metadata(self, path_info, fields=None):


could you please add comment what does it return? really hard to understand this. just trying to understand why logic is so complicated here

shcheklein · 2019-08-06T23:52:33Z

dvc/remote/gdrive/client.py

+                errors_count += 1
+                if errors_count >= 10:
+                    raise
+                sleep(1.0)


do we actually need this sleep here?

It is a must to wait some time between consecutive resumable upload requests. If one request would fail due to a short network problem then it would end up with all retry attempts failed in a short time. Maybe the same exponential backoff policy should be used here, as it is for error handling in self.request. Though it is not so clear for me would it be the right solution for resumable upload or not. The hardcoded 1 second sleep with 10 retries looks better, imho, but it is also not the best behavior, definitely.

One possible solution could be to store the upload process in the DVC's state database to make it possible to resume uploads between the dvc runs, but this feels like an overkill for me. Other backends don't care about the connection interruptions / server errors during large files upload at all, if I'm not mistaken.

Yep. DB is an overkill for sure. As for retries - is it possible to use some decorator out of many existing? In both cases.

shcheklein

a few questions to address

@efiop @ei-grad what else do we need to get it done, guys?

@ei-grad you wanted to try some library as far as I remember, what's your take on it?

shcheklein · 2019-08-07T00:05:45Z

@ei-grad please check DeepSource stuff as well, let's fix it (except obvious false positives).

shcheklein · 2019-08-08T00:25:38Z

dvc/remote/gdrive/__init__.py

+    def exists(self, path_info):
+        return self.client.exists(path_info)
+
+    def batch_exists(self, path_infos, callback):


@efiop do we need to update anything to support threading here - I mean status, etc? you are changing something with @pared as far as I understand.

Yes, batch_exists will be no longer needed after #2375 . As to threads in general, as long as self.client is thread-safe(looks like it is, but maybe @ei-grad could confirm/deny that) we will be fine.

efiop · 2019-09-29T21:34:36Z

Closing due to inactivity in favor of #2551

ei-grad changed the title ~~remote: implement Google Drive~~ [WIP] remote: implement Google Drive May 22, 2019

ei-grad force-pushed the google-drive branch 5 times, most recently from 4524712 to d249f69 Compare May 24, 2019 10:45

ei-grad mentioned this pull request May 25, 2019

Google Drive remote documentation iterative/dvc.org#381

Closed

efiop reviewed May 27, 2019

View reviewed changes

dvc/config.py Outdated Show resolved Hide resolved

efiop reviewed May 27, 2019

View reviewed changes

dvc/remote/gdrive/google-dvc-client-id.json Show resolved Hide resolved

efiop reviewed May 27, 2019

View reviewed changes

ei-grad force-pushed the google-drive branch from 646f928 to 679ad9a Compare May 30, 2019 15:34

ei-grad changed the title ~~[WIP] remote: implement Google Drive~~ remote: implement Google Drive May 30, 2019

efiop suggested changes May 31, 2019

View reviewed changes

tests/func/test_gdrive.py Outdated Show resolved Hide resolved

dvc/path/gdrive.py Outdated Show resolved Hide resolved