Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: slow #8359

Closed
dberenbaum opened this issue Sep 23, 2022 · 7 comments · Fixed by #8766
Closed

cloud versioning: slow #8359

dberenbaum opened this issue Sep 23, 2022 · 7 comments · Fixed by #8766
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Collaborator

Part of #7995

Description

Cloud versioning is working much slower than traditional remotes

Reproduce

$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc remote add -f -d cloud s3://mybucket/remote
$ time dvc push
8 files pushed
dvc push  0.71s user 0.17s system 13% cpu 6.626 total
$ dvc remote modify cloud version_aware true
$ dvc remote modify cloud worktree true
$ time dvc push
Updating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
9 files pushed
dvc push  0.98s user 0.21s system 6% cpu 17.443 total
@dberenbaum dberenbaum added A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do labels Sep 23, 2022
@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Oct 12, 2022
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Oct 31, 2022
@dberenbaum
Copy link
Collaborator Author

In the time it takes to complete a regular dvc push or aws s3 sync of a large directory, dvc push to a cloud-versioned remote hangs and shows no output (even with -v) and makes no progress that I can tell. Tested with the contents of s3://dave-sandbox-versioning/coco/test2014.

@boscacci
Copy link

Seconded, it would be nice to at least see some logging output with -v when pushing while worktree = true

@dberenbaum
Copy link
Collaborator Author

Another concern here is whether the current .dvc yaml structure is too slow when it contains large directories (tens or hundreds of thousands of files).

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Dec 2, 2022

Tested with 2800 images taken from https://github.com/iterative/dataset-registry/blob/master/use-cases/cats-dogs.dvc.

Results with cloud versioning:

$ time dvc push
2800 files pushed
dvc push  40.71s user 6.08s system 12% cpu 6:13.09 total
$ time dvc pull
A       cats-dogs/
1 file added and 2800 files fetched
dvc pull  24.05s user 6.45s system 15% cpu 3:17.54 total

Results without cloud versioning:

$ time dvc push 
2801 files pushed
dvc push -r docs/dvc use-cases/cats-dogs.dvc  24.75s user 3.50s system 37% cpu 1:15.45 total
$ time dvc pull
A       use-cases/cats-dogs/
1 file added and 2600 files fetched
dvc pull -r docs/dvc use-cases/cats-dogs.dvc  6.82s user 2.49s system 45% cpu 20.355 total

@dberenbaum
Copy link
Collaborator Author

Results for a slightly larger dataset.

With cloud versioning:

$ time dvc push
6946 files pushed
dvc push  97.23s user 16.91s system 8% cpu 23:39.43 total
$ time dvc pull
M       test2014/
1 file modified and 6946 files fetched
dvc pull  66.84s user 21.57s system 11% cpu 13:00.08 total

Without cloud versioning:

$ time dvc push
6947 files pushed
dvc push  50.29s user 8.22s system 21% cpu 4:35.87 total
$ time dvc pull
M       test2014/
1 file modified and 6946 files fetched
dvc pull  22.32s user 8.36s system 33% cpu 1:30.41 total

I can follow up with profiling for each.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Dec 22, 2022

Some profiling done with the data from s3://dave-sandbox-versioning/registry-cloud-versioned/cats-dogs/.

For regular cache, here's the zip of the yappi cachegrind output per thread:

cache.zip

Here's the same for version-aware:

version.zip

Edit: And here's worktree:

worktree.zip

Haven't looked through the output at all, but I notice that there are 45 threads/outputs from the regular cache but only 5 from the cloud-versioned ones. Is that expected?

@pmrowla pmrowla self-assigned this Dec 27, 2022
@pmrowla pmrowla added this to DVC Jan 4, 2023
@pmrowla pmrowla moved this from Backlog to In Progress in DVC Jan 4, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Jan 4, 2023
@dberenbaum
Copy link
Collaborator Author

After iterative/dvc-data#246, the only blocker should be pushing incremental changes to a version_aware remote (not an issue with worktree). For example, pushing when nothing has changed:

$ time dvc push -r cache
Everything is up to date.
dvc push -r cache  1.74s user 0.16s system 83% cpu 2.290 total
$ time dvc push -r worktree
Everything is up to date.
dvc push -r worktree  1.95s user 0.14s system 75% cpu 2.788 total
$ time dvc push -r versioned
Everything is up to date.
dvc push -r versioned  5.23s user 0.38s system 50% cpu 11.095 total

The version_aware remote hangs with a status bar stuck at 0% for most of the runtime. I'm guessing DVC is looking up each file by version and this takes a long time. Is it needed? I think we discussed that it might not be necessary to validate that the data exists on cloud if there is a version_id in the .dvc file. That might be a strong assumption to make, but it feels painfully slow compared to other remote types to check the status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants