-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glob and rglob can be slow for large directories or lots of files #274
Comments
Yes, I agree with the sentiment that we want a much faster simple recursive directory listing function than what we currently expose. As you point out, we work to make CloudPaths and Paths as interchangeable as possible, which means not relying on an expanded API for core functionality. With that in mind, there are two APIs for recursively iterating over the contents of a directory in pathlib. First, in Python 3.12+ we will have Path.walk. We'll put that aside for now. Second, I think we want the current most common ways of recursively listing directories
The In my mind, shortcutting these common "list everything" recursive calls to Would that work for your case? |
Yes, that would work well. |
That would be great, thanks! |
_list_dir
public
After some consideration, I realized that without #176 it wouldn't work, since I would still need to filter out files. |
Per discussion in #276, this needs some profiling to really get to the bottom of why |
Similar problem arose for me |
@Gilthans @RyanMarten We released a pretty big improvement here in #304. You can test it in version |
Hello!
While gleefully using
cloudpathlib
, I needed a recursive iteration of files in a directory. This directory is large (6238 files), so my first approach - a recursiveiterdir()
+is_file()
- took waaaay too long (likely due to #176).I remembered that
glob
made better use of the cloud list calls, so I triedlist(p.rglob("*"))
. In my directory, that took 17m:34s.I then tried to 'cheat' and call
[f for f, is_dir in p.client._list_dir(p, recursive=True) if not is_dir]
. It took 1.435s.I looked at the glob logic, but I still can't understand why the discrepency (Using Google Cloud). This may warrant another issue, but
However, I wonder if it might not be a good idea to make
_list_dir
a public function in the meantime as a workaround.Another option is to add
recursive
and/orfiles_only
keywords toiterdir
. This deviates from pathlib API, but since these are added keywords, it might be OK?I'm suggesting these options even though solving #176 would probably solve most issues, but these solutions are much simpler.
I'd of course be happy to send a PR.
WDYT?
The text was updated successfully, but these errors were encountered: