-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find runs by glob-ing levels of subdirectories #1087
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,36 +17,182 @@ | |
from __future__ import division | ||
from __future__ import print_function | ||
|
||
import collections | ||
import os | ||
import re | ||
|
||
import six | ||
import tensorflow as tf | ||
|
||
_ESCAPE_GLOB_CHARACTERS_REGEX = re.compile('([*?[])') | ||
|
||
|
||
# TODO(chihuahua): Rename this method to use camel-case for GCS (Gcs). | ||
def IsGCSPath(path): | ||
return path.startswith("gs://") | ||
|
||
|
||
def IsCnsPath(path): | ||
return path.startswith("/cns/") | ||
|
||
|
||
def IsTensorFlowEventsFile(path): | ||
"""Check the path name to see if it is probably a TF Events file. | ||
|
||
Args: | ||
path: A file path to check if it is an event file. | ||
|
||
Raises: | ||
ValueError: If the path is an empty string. | ||
|
||
Returns: | ||
If path is formatted like a TensorFlowEventsFile. | ||
""" | ||
if not path: | ||
raise ValueError('Path must be a nonempty string') | ||
return 'tfevents' in tf.compat.as_str_any(os.path.basename(path)) | ||
|
||
|
||
def ListDirectoryAbsolute(directory): | ||
"""Yields all files in the given directory. The paths are absolute.""" | ||
return (os.path.join(directory, path) | ||
for path in tf.gfile.ListDirectory(directory)) | ||
|
||
|
||
def ListRecursively(top): | ||
def _EscapeGlobCharacters(path): | ||
"""Escapes the glob characters in a path. | ||
|
||
Python 3 has a glob.escape method, but python 2 lacks it, so we manually | ||
implement this method. | ||
|
||
Args: | ||
path: The absolute path to escape. | ||
|
||
Returns: | ||
The escaped path string. | ||
""" | ||
drive, path = os.path.splitdrive(path) | ||
return '%s%s' % (drive, _ESCAPE_GLOB_CHARACTERS_REGEX.sub(r'[\1]', path)) | ||
|
||
|
||
def ListRecursivelyViaGlobbing(top): | ||
"""Recursively lists all files within the directory. | ||
|
||
This method does not list subdirectories (in addition to regular files), and | ||
the file paths are all absolute. If the directory does not exist, this yields | ||
nothing. | ||
|
||
This method does so by glob-ing deeper and deeper directories, ie | ||
foo/*, foo/*/*, foo/*/*/* and so on until all files are listed. All file | ||
paths are absolute, and this method lists subdirectories too. | ||
|
||
For certain file systems, Globbing via this method may prove | ||
significantly faster than recursively walking a directory. | ||
Specifically, file systems that implement analogs to TensorFlow's | ||
FileSystem.GetMatchingPaths method could save costly disk reads by using | ||
this method. However, for other file systems, this method might prove slower | ||
because the file system performs a walk per call to glob (in which case it | ||
might as well just perform 1 walk). | ||
|
||
Args: | ||
top: A path to a directory. | ||
|
||
Yields: | ||
A (dir_path, file_paths) tuple for each directory/subdirectory. | ||
""" | ||
current_glob_string = os.path.join(_EscapeGlobCharacters(top), '*') | ||
level = 0 | ||
|
||
while True: | ||
tf.logging.info('GlobAndListFiles: Starting to glob level %d', level) | ||
glob = tf.gfile.Glob(current_glob_string) | ||
tf.logging.info( | ||
'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level) | ||
|
||
if not glob: | ||
# This subdirectory level lacks files. Terminate. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I thought of another optional optimization. Don't have to do it in this PR but maybe add a TODO(nickfelt) to implement at some point? The idea is that if at any point the glob returns files that are all in a single directory (i.e. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great idea! Indeed, this should improve efficiency for that scenario! Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:
After listing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, I made the deep directory structure test this case. Specifically, see |
||
return | ||
|
||
# Map subdirectory to a list of files. | ||
pairs = collections.defaultdict(list) | ||
for file_path in glob: | ||
pairs[os.path.dirname(file_path)].append(file_path) | ||
for dir_name, file_paths in six.iteritems(pairs): | ||
yield (dir_name, tuple(file_paths)) | ||
|
||
if len(pairs) == 1: | ||
# If at any point the glob returns files that are all in a single | ||
# directory, replace the current globbing path with that directory as the | ||
# literal prefix. This should improve efficiency in cases where a single | ||
# subdir is significantly deeper than the rest of the sudirs. | ||
current_glob_string = os.path.join(list(pairs.keys())[0], '*') | ||
|
||
# Iterate to the next level of subdirectories. | ||
current_glob_string = os.path.join(current_glob_string, '*') | ||
level += 1 | ||
|
||
|
||
def ListRecursivelyViaWalking(top): | ||
"""Walks a directory tree, yielding (dir_path, file_paths) tuples. | ||
|
||
For each of `top` and its subdirectories, yields a tuple containing the path | ||
to the directory and the path to each of the contained files. Note that | ||
unlike os.Walk()/tf.gfile.Walk(), this does not list subdirectories and the | ||
file paths are all absolute. | ||
unlike os.Walk()/tf.gfile.Walk()/ListRecursivelyViaGlobbing, this does not | ||
list subdirectories. The file paths are all absolute. If the directory does | ||
not exist, this yields nothing. | ||
|
||
If the directory does not exist, this yields nothing. | ||
Walking may be incredibly slow on certain file systems. | ||
|
||
Args: | ||
top: A path to a directory.. | ||
top: A path to a directory. | ||
|
||
Yields: | ||
A list of (dir_path, file_paths) tuples. | ||
A (dir_path, file_paths) tuple for each directory/subdirectory. | ||
""" | ||
for dir_path, _, filenames in tf.gfile.Walk(top): | ||
yield (dir_path, (os.path.join(dir_path, filename) | ||
for filename in filenames)) | ||
|
||
|
||
def GetLogdirSubdirectories(path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should there be a test for GetLogdirSubdirectories() itself? It can probably be lighter weight since it's not testing the full recursive walk algorithm, but something that checks the basic logic of returning only subdirs that contain at least one events file seems good to have. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! Done. |
||
"""Obtains all subdirectories with events files. | ||
|
||
The order of the subdirectories returned is unspecified. The internal logic | ||
that determines order varies by scenario. | ||
|
||
Args: | ||
path: The path to a directory under which to find subdirectories. | ||
|
||
Returns: | ||
A tuple of absolute paths of all subdirectories each with at least 1 events | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably worth mentioning that the order of the subdirectories returned is unspecified, since the two different methods return the directories in different orders. Also, this could be done as a generator to match the old version of this function, with a little more work around the unique-ifying logic for the glob implementation. Do you think that would be useful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds great! |
||
file directly within the subdirectory. | ||
|
||
Raises: | ||
ValueError: If the path passed to the method exists and is not a directory. | ||
""" | ||
if not tf.gfile.Exists(path): | ||
# No directory to traverse. | ||
return () | ||
|
||
if not tf.gfile.IsDirectory(path): | ||
raise ValueError('GetLogdirSubdirectories: path exists and is not a ' | ||
'directory, %s' % path) | ||
|
||
if IsGCSPath(path) or IsCnsPath(path): | ||
# Glob-ing for files can be significantly faster than recursively | ||
# walking through directories for some file systems. | ||
tf.logging.info( | ||
'GetLogdirSubdirectories: Starting to list directories via glob-ing.') | ||
traversal_method = ListRecursivelyViaGlobbing | ||
else: | ||
# For other file systems, the glob-ing based method might be slower because | ||
# each call to glob could involve performing a recursive walk. | ||
tf.logging.info( | ||
'GetLogdirSubdirectories: Starting to list directories via walking.') | ||
traversal_method = ListRecursivelyViaWalking | ||
|
||
return ( | ||
subdir | ||
for (subdir, files) in traversal_method(path) | ||
if any(IsTensorFlowEventsFile(f) for f in files) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it could be good to keep the word "recursive" in the new function names just to signal that they're listing the entire directory tree. It's clear in the docstring but not so much from the function name itself.
I wonder it it might make sense to call these ListRecursivelyViaGlobbing() and ListRecursivelyViaWalking()? While the implementations are different, functionally they are basically interchangeable aside from slightly different output structures (which actually could be made the same by just doing per-subdir grouping in the globbing function before yielding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those names sound great! They indicate recursion. Furthermore, the methods (almost) return the same values now.
ListRecursivelyViaGlobbing
raises an exception when the directory does not exist.