-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud versioning: version_aware
status slow
#8774
Comments
The main problem here is that for regular DVC we are able to speed up the status check with:
But we can't use either of these optimizations with cloud versioning. Assuming that if we have a version ID we don't need to push is similar to the .dir optimization, but the difference there would be that we at least check one file in the remote (the .dir file), and for regular DVC remotes the expectation is that DVC is the only thing modifying the entire bucket (which is not necessarily the case with cloud versioning). Since we can't actually do any size estimation for versioned remotes, I suppose we could try running both the sequential Originally posted by @pmrowla in #8766 (comment) |
Can you explain more how the remote size estimation works? I didn't get it from this comment.
Which one do we do for
I guess this is where most of the time difference comes. What do you think about checking the first file entry in each .dvc file? And do you think it would be less of an issue if we implemented #7268? |
So depending on the # of objects whose status we have to check, and the total number of objects in the bucket, sometimes it is faster to query them individually w/ Essentially, for a relatively large # of objects to query, and a relatively small total remote size, listing is faster. For a relatively small # of objects to query and relatively large total remote size, it is faster to use individual For regular DVC remotes, we can get a rough estimate of the remote size, since we know the structure of the remote (content addressable storage according to MD5 hashes). MD5 is evenly distributed, so objects in the remote are also evenly distributed. For regular DVC remotes, this means we can do a listing of a subset of the remote by object key prefix, and use the # of returned results for that subset to estimate the size of the entire remote. Basically, if we know that the Once we have an estimated size, we can make a fairly good guess as to whether it will be faster for us to finish listing the rest of the remote, or just switch to using individual For cloud versioned remotes, there is no way for us to estimate the size of the remote, since we do not know anything about the # of objects within a given directory, and we do not know anything about the # of object versions for each object within a given directory.
For
With one of the selling points for cloud versioned remotes being that we leave things human-readable and in their original structure so that users can integrate it into their existing workflows that may be adding/removing things to the bucket without using DVC, we can't make the same kind of assumptions that let us do things like the .dir optimization. The .dir optimization for regular remotes only works because we can make a relatively safe assumption that DVC is the only thing making modifications to the bucket (whether it's from Adding the force push workaround would help, but that's still really only a workaround, and it only helps when the user actually notices that some files were previously not pushed as expected. |
Seems like dql has a similar problem, and AFAIK they plan to ask users when they want to actually reindex what's in the cloud, so it seems consistent for now to assume everything exists until forced to check again. One problem I found when you included that assumption in #8766 is that when testing with both a |
@dberenbaum there's WIP PRs for this you can try, from a venv with dvc main:
|
Thanks @pmrowla. I'm getting mixed results testing it: not much difference with the 2800-file dataset in s3://dave-sandbox-versioning/coco-small-test but substantial improvement with the 7000 file dataset in s3://dave-sandbox-versioning/coco-small-test. Since the bigger datasets are obviously more important, seems like it should be good enough for now. More importantly, those changes seem to be breaking the status check with
|
After iterative/dvc-data#246, the only blocker should be pushing incremental changes to a
version_aware
remote (not an issue withworktree
). For example, pushing when nothing has changed:The
version_aware
remote hangs with a status bar stuck at 0% for most of the runtime. I'm guessing DVC is looking up each file by version and this takes a long time. Is it needed? I think we discussed that it might not be necessary to validate that the data exists on cloud if there is aversion_id
in the .dvc file. That might be a strong assumption to make, but it feels painfully slow compared to other remote types to check the status.Originally posted by @dberenbaum in #8359 (comment)
The text was updated successfully, but these errors were encountered: