-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: add batch utils.exists implementation #186
Conversation
f3a3ef2
to
1495b96
Compare
1495b96
to
5bfa5c2
Compare
Codecov ReportBase: 67.30% // Head: 65.32% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #186 +/- ##
==========================================
- Coverage 67.30% 65.32% -1.98%
==========================================
Files 25 25
Lines 1826 1892 +66
Branches 265 289 +24
==========================================
+ Hits 1229 1236 +7
- Misses 550 601 +51
- Partials 47 55 +8
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
src/dvc_objects/fs/generic.py
Outdated
# NOTE: if we started a long running lsdir it will continue to run in | ||
# the background until the task completes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue here is that fsspec ls() (or find) only returns a complete result (instead of using iterators). So there's no straightforward way for us to short-circuit the fsspec ls call once it has been started. So if we call ls() on what ends up being a large directory with a large # of object versions, that call will keep running in the background thread until fsspec has gotten the entire listing and returned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note check iterdir: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L703
5bfa5c2
to
c80a03c
Compare
Adds naive
fs.utils.exists
implementation for querying batched file existence. Essentially it is a dumbed down and much less optimized alternative to what we do for querying ODB status.For
version_aware
remotes, onpush
we need to check whether or not version IDs that we have in the .dvc file still exist in the remote (old object versions could have been removed from the bucket for a variety of reasons). If the version ID we have no longer exists, we should push the object again and update the .dvc file with the new version ID.To check this existence we have two options:
fs.exists()
for each file we have to pushfs.ls(<parent dir(s)>, versions=True)
calls and get a listing of all object versions in a given directory (and then compare that list with the list of files we need to check)This presents the same problem we have w/regular remotes: Option 1 is faster when we have a relatively small number of files/versions to check and a bucket with a relatively large number of object versions. Option 2 is faster when we have a relatively large number of versions to check and a relatively small number of versions in the bucket.
For regular remotes we know the remote structure is evenly distributed according to md5 prefixes and can make an actual estimate as to which method will be faster, but for cloud versioning we have no idea what the dir structure of the remote looks like, and have no idea how many versions there will be for any given object in the remote. So essentially we cannot know beforehand which option to use when checking status for cloud versioned remotes.
This PR goes with the simplest solution I could think of and just does both operations in parallel until we have gotten a result for all the paths we are looking for.
related: iterative/dvc#8774