-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find runs by glob-ing levels of subdirectories #1087
Conversation
@@ -432,17 +471,3 @@ def GetAccumulator(self, run): | |||
""" | |||
with self._accumulators_mutex: | |||
return self._accumulators[run] | |||
|
|||
|
|||
def GetLogdirSubdirectories(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would it be more general to just replace LIstRecursively() with the iterative glob technique, rather than leaving it using Walk() and then getting rid of both copies of GetLogdirSubdirectories()? That way we don't have to duplicate the code in both multiplexers.
I think it doesn't really add overhead to return all the filenames and then do the filtering in a single pass later (in GetLogdirSubdirectories() like we do now) as long as we keep ListRecursively() as a generator the way it is now. We could even make GetLogdirSubdirectories() itself a generator too so that it doesn't block on finishing all the globbing, by just maintaining a dict of subdirs and only yielding a subdir for a given file when it's not yet in the dict.
Happy to send a PR myself for this alternative if that would be easiest, just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. See updates.
I've done a bit more thinking and research here and this might be a bit more complicated than we initially thought. My original assumption was that lot of the performance benefit of the glob approach had to do with reducing back-and-forth across the python/C++ boundary. But this is belied by the fact that the existing gfile.Walk()-based implementation actually is fine on local disk - it executes quickly even for very large logdirs. My new interpretation is that the real speedup comes from reducing the number of round trips from local C++ code to a remote filesystem and back. Globbing achieves a huge reduction in round trips when the remote file system implements TF's FileSystem.GetMatchingPaths() in the equivalent of a single possibly-streaming RPC, which is true of CNS (an internal-only filesystem) and happens to also be essentially true for GCS (link to code). However, this isn't generally true for other remote TF FileSystem implementations or even for other internal Google filesystems. In those cases, each level of the glob actually does a full gfile.Walk() - Given that issue, I think we should probably make this behavior conditioned on the underlying filesystem implementation, and then enable it only when we know the filesystem is well-suited to it. The "correct" way to do that would arguably be adding another C++-level TF FileSystem method for this, but for now we could also just add a heuristic within our codebase that checks the known path prefixes. Note that if this interpretation of the speedup is correct, then writing our own C++ code won't necessarily add much additional speedup, because there the real limitation is what RPCs the remote filesystem exposes. With GCS, if we had access to the underlying RPC interface we could implement recursive listing in a single (up to pagination) RPC, so it could be worthwhile. But with CNS, the C++ code doesn't have access to any RPC that's more efficient for our purposes here than the one used by glob, and unlike GCS that one is limited to returning one directory level at a time, so there's not much overhead we could remove without literally writing new RPCs for CNS. |
Thank you, Nick! That makes sense to me. +1 to how the calls to These updates take into account the nuance of how the glob-ing based method makes listing files much faster for certain file systems (that implement analogs of These new changes move much logic into |
'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level) | ||
|
||
if not glob: | ||
# This subdirectory level lacks files. Terminate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I thought of another optional optimization. Don't have to do it in this PR but maybe add a TODO(nickfelt) to implement at some point? The idea is that if at any point the glob returns files that are all in a single directory (i.e. len(set(map(os.path.dirname, glob))) == 1
) then we can replace the current globbing path with that directory as the literal prefix. We can think of this as basically saying that once the globbing is decending down a single branch in the directory tree, we can focus on that specific branch rather than still globbing through all the layers. This should improve efficiency in cases where a single subdir is significantly deeper than the rest of the log directory subdirs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Indeed, this should improve efficiency for that scenario! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:
- root
- dir1
- fileA
- dir2
- fileB
- dir2-child
- fileC
- dir2-grandchild1
- fileD
- dir2-grandchild2
- fileE
- dir1
After listing root/*/*/*
which should produce /root/dir2/dir2-child/fileC
and /root/dir2/dir2-child/dir2-grandchild1
and 2
, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/*
which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD
and dir2-grandchild2/fileE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I made the deep directory structure test this case. Specifically, see quuz/garply/corge
and quuz/garply/grault
.
@@ -64,32 +86,30 @@ def testListDirectoryAbsolute(self): | |||
(os.path.join(temp_dir, f) for f in expected_files), | |||
io_wrapper.ListDirectoryAbsolute(temp_dir)) | |||
|
|||
def testListRecursivelyForNestedFiles(self): | |||
def testGlobAndListFiles(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also have a test for this on an empty directory, or an otherwise non-deep directory?
Alternatively, another way to approach this could be to rearrange the globbing method a little so that it returns the exact same (subdir, files) tuples that the Walk()-based version does, and then just test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.
Awesome idea. The 2 methods now return via the same format, but their tests are slightly different since GlobAndListFiles
excludes subdirectories from listings.
self.assertEqual(1, len(subdirectory_entries)) | ||
|
||
entry = subdirectory_entries[0] | ||
self.assertEqual(empty_dir, entry[0]) | ||
self.assertItemsEqual((), entry[1]) | ||
|
||
def _CreateDeepDirectoryStructure(self, top_directory): | ||
"""Creates a reasonable deep structure of subdirectories with files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This structure could probably be a little more complicated to fully exercise all the possible edge cases. Some things that might be good to include:
- Subdir that is non-empty but has no event files (should be ignored)
- Subdir that has no event files, but has a subdir that has event files (first level should be ignored, second level should be included)
- "Sibling" subdirs that both have event files (i.e. two subdirs at same depth)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we now cover those cases. Comments now clarify them.
Yields: | ||
The absolute path of each file. | ||
""" | ||
current_glob_string = top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but we may want to escape the original directory from globbing just in case somebody has an actual directory with glob-aware characters in it (fairly unlikely, but possible). In python 3.4 plus there's glob.escape() but unfortunately nothing in python 2.7. However, it's a short function, so we could just backport it:
https://github.com/python/cpython/blob/3.6/Lib/glob.py#L161
Feel free to also just leave this as a TODO(nickfelt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noting that detail! Done.
"""Recursively lists all files within the directory. | ||
|
||
This method does so by glob-ing deeper and deeper directories, ie | ||
/foo/*, foo/*/*, foo/*/*/* and so on until all files are listed. All file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove initial slash from "/foo/*" to match the other examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
path: The path to a directory under which to find subdirectories. | ||
|
||
Returns: | ||
A tuple of absolute paths of all subdirectories each with at least 1 events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth mentioning that the order of the subdirectories returned is unspecified, since the two different methods return the directories in different orders.
Also, this could be done as a generator to match the old version of this function, with a little more work around the unique-ifying logic for the glob implementation. Do you think that would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! ListRecursivelyViaGlobbing
and ListRecursivelyViaWalking
now yield the same types of tuples. One difference is that the lists of absolute paths returned by ListRecursivelyViaGlobbing
include subdirectories.
def ListDirectoryAbsolute(directory): | ||
"""Yields all files in the given directory. The paths are absolute.""" | ||
return (os.path.join(directory, path) | ||
for path in tf.gfile.ListDirectory(directory)) | ||
|
||
|
||
def ListRecursively(top): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it could be good to keep the word "recursive" in the new function names just to signal that they're listing the entire directory tree. It's clear in the docstring but not so much from the function name itself.
I wonder it it might make sense to call these ListRecursivelyViaGlobbing() and ListRecursivelyViaWalking()? While the implementations are different, functionally they are basically interchangeable aside from slightly different output structures (which actually could be made the same by just doing per-subdir grouping in the globbing function before yielding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those names sound great! They indicate recursion. Furthermore, the methods (almost) return the same values now. ListRecursivelyViaGlobbing
raises an exception when the directory does not exist.
@@ -193,7 +193,12 @@ def Fake(*args, **kwargs): | |||
|
|||
FakeFactory.has_been_called = False | |||
|
|||
for stub_name in ['ListDirectoryAbsolute', 'ListRecursively']: | |||
stub_names = [ | |||
'GlobAndListFiles', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pairs = collections.defaultdict(list) | ||
for file_path in glob: | ||
pairs[os.path.dirname(file_path)].append(file_path) | ||
for dir_name, file_paths in pairs.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use six.iteritems() to keep memory usage down, since we're yielding each one individually anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if IsGCSPath(path) or IsCnsPath(path): | ||
# Glob-ing for files can be significantly faster than recursively | ||
# walking through directories for some file systems. | ||
# Use a dict comprehension to unique-ify subdirectories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of comment is stale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
'bar/quux', | ||
# A directory that lacks events files, but contains 2 subdirectories | ||
# with events files (first level should be ignored, second level should | ||
# be included). corge and grault are thus subling events files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: sibling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# be included). corge and grault are thus subling events files. | ||
'quuz/corge', | ||
'quuz/grault', | ||
# A directory with a glob character in its name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, but I think this would only have been affected by the lack of escaping if it was passed in as the original argument to ListRecursivelyViaGlobbing()? I.e. I don't think the test would fail if we didn't have escaping.
So maybe worth having a separate test for this that creates a "ba*" subdirectory of the tmpdir and then passes that as the directory to list. Also note that the glob "ba*" will actually match a literal "ba*" directory so by itself that doesn't indicate the issue; to test that it's escaped we'd need to have a sibling directory "bar" and check that we only recursively list "ba*" and not "bar". (An easier test for escaping might be to test that a literal path like "foo[bar]" is matched, since the glob interpretation only matches "foob" "fooa" or "foor" but not literal "foo[bar]".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks for the catch. We now add a test that globs a directory containing a *
in its name. The test checks that we only recursively list "ba*" and not "bar" - the other method would work too!
self._CreateDeepDirectoryStructure(temp_dir) | ||
expected = [ | ||
['', [ | ||
'events.out.tfevents.1473720381.meep.com', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally optional, but since this test is only about generic recursive directory listing, these names wouldn't have to be event file names. They could just be "a" "b" "c" etc. In that case, they might fit all on one line and be a bit easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the events files are hard to tell apart at a glance. How about we use the minimal semblance of an events file? ie, a.tfevents.1
, albeit the method contract does not care about file type ...
""" | ||
for dir_path, _, filenames in tf.gfile.Walk(top): | ||
yield (dir_path, (os.path.join(dir_path, filename) | ||
for filename in filenames)) | ||
|
||
|
||
def GetLogdirSubdirectories(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a test for GetLogdirSubdirectories() itself? It can probably be lighter weight since it's not testing the full recursive walk algorithm, but something that checks the basic logic of returning only subdirs that contain at least one events file seems good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Done.
'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level) | ||
|
||
if not glob: | ||
# This subdirectory level lacks files. Terminate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:
- root
- dir1
- fileA
- dir2
- fileB
- dir2-child
- fileC
- dir2-grandchild1
- fileD
- dir2-grandchild2
- fileE
- dir1
After listing root/*/*/*
which should produce /root/dir2/dir2-child/fileC
and /root/dir2/dir2-child/dir2-grandchild1
and 2
, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/*
which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD
and dir2-grandchild2/fileE
.
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
Previously, TensorBoard found runs within the log directory by performing a `tf.gfile.Walk` across directories. That turned out to be very slow for users with expansive log directories (with many subdirectories and files) because `tf.gfile.Walk` makes many calls at the python level to C++ logic for reading data from disk (as opposed to performing much of that logic directly at the C++ level). Disk reads can be very costly for certain (such as remote and distributed) file systems. For those file systems, we find runs (subdirectories) much more quickly by iteratively glob-ing deeper and deeper directory levels. For a certain internal user, this reduced the time it took to find all runs from 20 minutes to 20 seconds. For other file systems, we preserve the previous behavior of performing a recursive walk aross directories. For certain file systems, using the glob-based method to find runs might actually be slower because those file systems do not implement analogs of TensorFlow's method for matching files on disk via a pattern, resulting in costly disk accesses. This change also re-organizes code to make that logic possible. We move code away from the 2 multiplexers into the io_wrapper helper module. I am generally not a huge of fan of large helper modules, but in this case, it seems desired because we have duplicate entire multiplexers and event accumulators. We also generalize the logic for listing files via blobbing into a `GlobAndListFiles` method.
Previously, TensorBoard found runs within the log directory by
performing a
tf.gfile.Walk
across directories. That turned out to bevery slow for users with expansive log directories (with many
subdirectories and files) because
tf.gfile.Walk
makes many calls atthe python level to C++ logic for reading data from disk (as opposed to
performing much of that logic directly at the C++ level). Disk reads can
be very costly for certain (such as remote and distributed) file systems.
We find runs (subdirectories) much more quickly by iteratively glob-ing
deeper and deeper directory levels. For a certain internal user, this
reduced the time it took to find all runs from 20 minutes to 20 seconds.
We already have tests that check whether the multiplexer correctly finds
subdirectories that are runs, so this pull request adds no further
tests. We preserve that correctness and just speed up that logic. This
change may result in us finding runs in different orders, but that is
fine - TensorBoard documents no guarantees in terms of the order in
which runs are discovered.
Thank you to @nfelt for this idea and the fruitful brainstorming
session.