Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find runs by glob-ing levels of subdirectories #1087

Merged
merged 4 commits into from
Apr 4, 2018

Conversation

chihuahua
Copy link
Member

@chihuahua chihuahua commented Mar 31, 2018

Previously, TensorBoard found runs within the log directory by
performing a tf.gfile.Walk across directories. That turned out to be
very slow for users with expansive log directories (with many
subdirectories and files) because tf.gfile.Walk makes many calls at
the python level to C++ logic for reading data from disk (as opposed to
performing much of that logic directly at the C++ level). Disk reads can
be very costly for certain (such as remote and distributed) file systems.

We find runs (subdirectories) much more quickly by iteratively glob-ing
deeper and deeper directory levels. For a certain internal user, this
reduced the time it took to find all runs from 20 minutes to 20 seconds.

We already have tests that check whether the multiplexer correctly finds
subdirectories that are runs, so this pull request adds no further
tests. We preserve that correctness and just speed up that logic. This
change may result in us finding runs in different orders, but that is
fine - TensorBoard documents no guarantees in terms of the order in
which runs are discovered.

Thank you to @nfelt for this idea and the fruitful brainstorming
session.

@@ -432,17 +471,3 @@ def GetAccumulator(self, run):
"""
with self._accumulators_mutex:
return self._accumulators[run]


def GetLogdirSubdirectories(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it be more general to just replace LIstRecursively() with the iterative glob technique, rather than leaving it using Walk() and then getting rid of both copies of GetLogdirSubdirectories()? That way we don't have to duplicate the code in both multiplexers.

I think it doesn't really add overhead to return all the filenames and then do the filtering in a single pass later (in GetLogdirSubdirectories() like we do now) as long as we keep ListRecursively() as a generator the way it is now. We could even make GetLogdirSubdirectories() itself a generator too so that it doesn't block on finishing all the globbing, by just maintaining a dict of subdirs and only yielding a subdir for a given file when it's not yet in the dict.

Happy to send a PR myself for this alternative if that would be easiest, just let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. See updates.

@nfelt
Copy link
Contributor

nfelt commented Mar 31, 2018

I've done a bit more thinking and research here and this might be a bit more complicated than we initially thought. My original assumption was that lot of the performance benefit of the glob approach had to do with reducing back-and-forth across the python/C++ boundary. But this is belied by the fact that the existing gfile.Walk()-based implementation actually is fine on local disk - it executes quickly even for very large logdirs.

My new interpretation is that the real speedup comes from reducing the number of round trips from local C++ code to a remote filesystem and back. Globbing achieves a huge reduction in round trips when the remote file system implements TF's FileSystem.GetMatchingPaths() in the equivalent of a single possibly-streaming RPC, which is true of CNS (an internal-only filesystem) and happens to also be essentially true for GCS (link to code). However, this isn't generally true for other remote TF FileSystem implementations or even for other internal Google filesystems. In those cases, each level of the glob actually does a full gfile.Walk() - check if dir; if so, list it; recurse on results (it should be a depth-bounded walk, but the FileSystem default implementation doesn't actually do any such bounding...). So for those remote FileSystems we're now doing D+1 times as much work and sending D+1 times as many RPCs as the existing approach.

Given that issue, I think we should probably make this behavior conditioned on the underlying filesystem implementation, and then enable it only when we know the filesystem is well-suited to it. The "correct" way to do that would arguably be adding another C++-level TF FileSystem method for this, but for now we could also just add a heuristic within our codebase that checks the known path prefixes.

Note that if this interpretation of the speedup is correct, then writing our own C++ code won't necessarily add much additional speedup, because there the real limitation is what RPCs the remote filesystem exposes. With GCS, if we had access to the underlying RPC interface we could implement recursive listing in a single (up to pagination) RPC, so it could be worthwhile. But with CNS, the C++ code doesn't have access to any RPC that's more efficient for our purposes here than the one used by glob, and unlike GCS that one is limited to returning one directory level at a time, so there's not much overhead we could remove without literally writing new RPCs for CNS.

@chihuahua
Copy link
Member Author

chihuahua commented Mar 31, 2018

Thank you, Nick! That makes sense to me. +1 to how the calls to walk take big latency hits from individual disk accesses to CNS. PTAL

These updates take into account the nuance of how the glob-ing based method makes listing files much faster for certain file systems (that implement analogs of GetMatchingPaths), but might have a different effect for other file systems. Based on internal documentation for Glob patterns, the glob-ing based method could raise errors for certain file systems within Google because the method could use patterns with multiple *s.

These new changes move much logic into io_wrapper so that they can be generalized, tested, and placed in a file that organizationally makes sense. To emphasize the difference between glob-ing for all files and performing the recursive walk, we assign starkly different names to the GlobAndListFiles and WalkAndListFilesPerDirectory methods.

'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level)

if not glob:
# This subdirectory level lacks files. Terminate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I thought of another optional optimization. Don't have to do it in this PR but maybe add a TODO(nickfelt) to implement at some point? The idea is that if at any point the glob returns files that are all in a single directory (i.e. len(set(map(os.path.dirname, glob))) == 1) then we can replace the current globbing path with that directory as the literal prefix. We can think of this as basically saying that once the globbing is decending down a single branch in the directory tree, we can focus on that specific branch rather than still globbing through all the layers. This should improve efficiency in cases where a single subdir is significantly deeper than the rest of the log directory subdirs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Indeed, this should improve efficiency for that scenario! Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:

  • root
    • dir1
      • fileA
    • dir2
      • fileB
      • dir2-child
        • fileC
        • dir2-grandchild1
          • fileD
        • dir2-grandchild2
          • fileE

After listing root/*/*/* which should produce /root/dir2/dir2-child/fileC and /root/dir2/dir2-child/dir2-grandchild1 and 2, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/* which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD and dir2-grandchild2/fileE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I made the deep directory structure test this case. Specifically, see quuz/garply/corge and quuz/garply/grault.

@@ -64,32 +86,30 @@ def testListDirectoryAbsolute(self):
(os.path.join(temp_dir, f) for f in expected_files),
io_wrapper.ListDirectoryAbsolute(temp_dir))

def testListRecursivelyForNestedFiles(self):
def testGlobAndListFiles(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also have a test for this on an empty directory, or an otherwise non-deep directory?

Alternatively, another way to approach this could be to rearrange the globbing method a little so that it returns the exact same (subdir, files) tuples that the Walk()-based version does, and then just test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.

Awesome idea. The 2 methods now return via the same format, but their tests are slightly different since GlobAndListFiles excludes subdirectories from listings.

self.assertEqual(1, len(subdirectory_entries))

entry = subdirectory_entries[0]
self.assertEqual(empty_dir, entry[0])
self.assertItemsEqual((), entry[1])

def _CreateDeepDirectoryStructure(self, top_directory):
"""Creates a reasonable deep structure of subdirectories with files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure could probably be a little more complicated to fully exercise all the possible edge cases. Some things that might be good to include:

  • Subdir that is non-empty but has no event files (should be ignored)
  • Subdir that has no event files, but has a subdir that has event files (first level should be ignored, second level should be included)
  • "Sibling" subdirs that both have event files (i.e. two subdirs at same depth)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we now cover those cases. Comments now clarify them.

Yields:
The absolute path of each file.
"""
current_glob_string = top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but we may want to escape the original directory from globbing just in case somebody has an actual directory with glob-aware characters in it (fairly unlikely, but possible). In python 3.4 plus there's glob.escape() but unfortunately nothing in python 2.7. However, it's a short function, so we could just backport it:
https://github.com/python/cpython/blob/3.6/Lib/glob.py#L161

Feel free to also just leave this as a TODO(nickfelt).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting that detail! Done.

"""Recursively lists all files within the directory.

This method does so by glob-ing deeper and deeper directories, ie
/foo/*, foo/*/*, foo/*/*/* and so on until all files are listed. All file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove initial slash from "/foo/*" to match the other examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

path: The path to a directory under which to find subdirectories.

Returns:
A tuple of absolute paths of all subdirectories each with at least 1 events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth mentioning that the order of the subdirectories returned is unspecified, since the two different methods return the directories in different orders.

Also, this could be done as a generator to match the old version of this function, with a little more work around the unique-ifying logic for the glob implementation. Do you think that would be useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! ListRecursivelyViaGlobbing and ListRecursivelyViaWalking now yield the same types of tuples. One difference is that the lists of absolute paths returned by ListRecursivelyViaGlobbing include subdirectories.

def ListDirectoryAbsolute(directory):
"""Yields all files in the given directory. The paths are absolute."""
return (os.path.join(directory, path)
for path in tf.gfile.ListDirectory(directory))


def ListRecursively(top):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it could be good to keep the word "recursive" in the new function names just to signal that they're listing the entire directory tree. It's clear in the docstring but not so much from the function name itself.

I wonder it it might make sense to call these ListRecursivelyViaGlobbing() and ListRecursivelyViaWalking()? While the implementations are different, functionally they are basically interchangeable aside from slightly different output structures (which actually could be made the same by just doing per-subdir grouping in the globbing function before yielding).

Copy link
Member Author

@chihuahua chihuahua Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those names sound great! They indicate recursion. Furthermore, the methods (almost) return the same values now. ListRecursivelyViaGlobbing raises an exception when the directory does not exist.

@@ -193,7 +193,12 @@ def Fake(*args, **kwargs):

FakeFactory.has_been_called = False

for stub_name in ['ListDirectoryAbsolute', 'ListRecursively']:
stub_names = [
'GlobAndListFiles',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs update?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pairs = collections.defaultdict(list)
for file_path in glob:
pairs[os.path.dirname(file_path)].append(file_path)
for dir_name, file_paths in pairs.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use six.iteritems() to keep memory usage down, since we're yielding each one individually anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if IsGCSPath(path) or IsCnsPath(path):
# Glob-ing for files can be significantly faster than recursively
# walking through directories for some file systems.
# Use a dict comprehension to unique-ify subdirectories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of comment is stale

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

'bar/quux',
# A directory that lacks events files, but contains 2 subdirectories
# with events files (first level should be ignored, second level should
# be included). corge and grault are thus subling events files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sibling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# be included). corge and grault are thus subling events files.
'quuz/corge',
'quuz/grault',
# A directory with a glob character in its name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but I think this would only have been affected by the lack of escaping if it was passed in as the original argument to ListRecursivelyViaGlobbing()? I.e. I don't think the test would fail if we didn't have escaping.

So maybe worth having a separate test for this that creates a "ba*" subdirectory of the tmpdir and then passes that as the directory to list. Also note that the glob "ba*" will actually match a literal "ba*" directory so by itself that doesn't indicate the issue; to test that it's escaped we'd need to have a sibling directory "bar" and check that we only recursively list "ba*" and not "bar". (An easier test for escaping might be to test that a literal path like "foo[bar]" is matched, since the glob interpretation only matches "foob" "fooa" or "foor" but not literal "foo[bar]".)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks for the catch. We now add a test that globs a directory containing a * in its name. The test checks that we only recursively list "ba*" and not "bar" - the other method would work too!

self._CreateDeepDirectoryStructure(temp_dir)
expected = [
['', [
'events.out.tfevents.1473720381.meep.com',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally optional, but since this test is only about generic recursive directory listing, these names wouldn't have to be event file names. They could just be "a" "b" "c" etc. In that case, they might fit all on one line and be a bit easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the events files are hard to tell apart at a glance. How about we use the minimal semblance of an events file? ie, a.tfevents.1, albeit the method contract does not care about file type ...

"""
for dir_path, _, filenames in tf.gfile.Walk(top):
yield (dir_path, (os.path.join(dir_path, filename)
for filename in filenames))


def GetLogdirSubdirectories(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test for GetLogdirSubdirectories() itself? It can probably be lighter weight since it's not testing the full recursive walk algorithm, but something that checks the basic logic of returning only subdirs that contain at least one events file seems good to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Done.

'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level)

if not glob:
# This subdirectory level lacks files. Terminate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:

  • root
    • dir1
      • fileA
    • dir2
      • fileB
      • dir2-child
        • fileC
        • dir2-grandchild1
          • fileD
        • dir2-grandchild2
          • fileE

After listing root/*/*/* which should produce /root/dir2/dir2-child/fileC and /root/dir2/dir2-child/dir2-grandchild1 and 2, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/* which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD and dir2-grandchild2/fileE.

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

Previously, TensorBoard found runs within the log directory by
performing a `tf.gfile.Walk` across directories. That turned out to be
very slow for users with expansive log directories (with many
subdirectories and files) because `tf.gfile.Walk` makes many calls at
the python level to C++ logic for reading data from disk (as opposed to
performing much of that logic directly at the C++ level). Disk reads can
be very costly for certain (such as remote and distributed) file systems.

For those file systems, we find runs (subdirectories) much more quickly
by iteratively glob-ing deeper and deeper directory levels. For a
certain internal user, this reduced the time it took to find all runs
from 20 minutes to 20 seconds.

For other file systems, we preserve the previous behavior of performing
a recursive walk aross directories. For certain file systems, using the
glob-based method to find runs might actually be slower because those
file systems do not implement analogs of TensorFlow's method for
matching files on disk via a pattern, resulting in costly disk
accesses.

This change also re-organizes code to make that logic possible. We move
code away from the 2 multiplexers into the io_wrapper helper module. I
am generally not a huge of fan of large helper modules, but in this
case, it seems desired because we have duplicate entire multiplexers and
event accumulators. We also generalize the logic for listing files via
blobbing into a `GlobAndListFiles` method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants