Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: add batch utils.exists implementation #186

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jan 18, 2023

Adds naive fs.utils.exists implementation for querying batched file existence. Essentially it is a dumbed down and much less optimized alternative to what we do for querying ODB status.

For version_aware remotes, on push we need to check whether or not version IDs that we have in the .dvc file still exist in the remote (old object versions could have been removed from the bucket for a variety of reasons). If the version ID we have no longer exists, we should push the object again and update the .dvc file with the new version ID.

To check this existence we have two options:

  1. Check existence using fs.exists() for each file we have to push
  2. Use fs.ls(<parent dir(s)>, versions=True) calls and get a listing of all object versions in a given directory (and then compare that list with the list of files we need to check)

This presents the same problem we have w/regular remotes: Option 1 is faster when we have a relatively small number of files/versions to check and a bucket with a relatively large number of object versions. Option 2 is faster when we have a relatively large number of versions to check and a relatively small number of versions in the bucket.

For regular remotes we know the remote structure is evenly distributed according to md5 prefixes and can make an actual estimate as to which method will be faster, but for cloud versioning we have no idea what the dir structure of the remote looks like, and have no idea how many versions there will be for any given object in the remote. So essentially we cannot know beforehand which option to use when checking status for cloud versioned remotes.

This PR goes with the simplest solution I could think of and just does both operations in parallel until we have gotten a result for all the paths we are looking for.

related: iterative/dvc#8774

@pmrowla pmrowla self-assigned this Jan 18, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2023

Codecov Report

Base: 67.30% // Head: 65.32% // Decreases project coverage by -1.98% ⚠️

Coverage data is based on head (c80a03c) compared to base (30d3681).
Patch coverage: 13.23% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #186      +/-   ##
==========================================
- Coverage   67.30%   65.32%   -1.98%     
==========================================
  Files          25       25              
  Lines        1826     1892      +66     
  Branches      265      289      +24     
==========================================
+ Hits         1229     1236       +7     
- Misses        550      601      +51     
- Partials       47       55       +8     
Impacted Files Coverage Δ
src/dvc_objects/fs/utils.py 30.64% <11.94%> (-11.03%) ⬇️
src/dvc_objects/fs/base.py 55.18% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines 464 to 465
# NOTE: if we started a long running lsdir it will continue to run in
# the background until the task completes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue here is that fsspec ls() (or find) only returns a complete result (instead of using iterators). So there's no straightforward way for us to short-circuit the fsspec ls call once it has been started. So if we call ls() on what ends up being a large directory with a large # of object versions, that call will keep running in the background thread until fsspec has gotten the entire listing and returned it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla pmrowla changed the title fs: add generic.status implementation fs: add batch utils.exists implementaiton Jan 19, 2023
@pmrowla pmrowla marked this pull request as ready for review January 19, 2023 08:01
@pmrowla pmrowla merged commit af41d3c into iterative:main Jan 19, 2023
@pmrowla pmrowla deleted the generic-status branch January 19, 2023 08:03
@pmrowla pmrowla changed the title fs: add batch utils.exists implementaiton fs: add batch utils.exists implementation Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants