Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path/filepath: add WalkDir (Walk using DirEntry) #42027

Closed
rsc opened this issue Oct 16, 2020 · 21 comments
Closed

path/filepath: add WalkDir (Walk using DirEntry) #42027

rsc opened this issue Oct 16, 2020 · 21 comments

Comments

@rsc
Copy link
Contributor

rsc commented Oct 16, 2020

There are a few annoyances with filepath.Walk, but the biggest problem is that it is needlessly inefficient. The new ReadDir API (#41467) provides a way to avoid the inefficiency in the implementation, but that must be paired with a new API that does not offer a FileInfo to the callback, since obtaining the FileInfo is the expensive part.

#41974 proposes a new API with an iterator object. That may or may not be a good idea.

If that one doesn't work out, here's a smaller change: add WalkDir that replaces FileInfo with DirEntry but otherwise behaves exactly the same as Walk:

type WalkDirFunc func(path string, entry fs.DirEntry, err error) error
    WalkDirFunc is the type of the function called for each file or directory
    visited by WalkDir. The path argument contains the argument to WalkDir as a
    prefix; that is, if WalkDir is called with "dir", which is a directory
    containing the file "a", the walk function will be called with argument
    "dir/a". The info argument is the fs.DirEntry for the named path.

    If there was a problem walking to the file or directory named by path, the
    incoming error will describe the problem and the function can decide how to
    handle that error (and WalkDir will not descend into that directory). In the
    case of an error, the info argument will be nil. If an error is returned,
    processing stops. The sole exception is when the function returns the
    special value SkipDir. If the function returns SkipDir when invoked on a
    directory, WalkDir skips the directory's contents entirely. If the function
    returns SkipDir when invoked on a non-directory file, WalkDir skips the
    remaining files in the containing directory.

func WalkDir(root string, fn WalkDirFunc) error
    WalkDir walks the file tree rooted at root, calling fn for each file or
    directory in the tree, including root. All errors that arise visiting files
    and directories are filtered by fn. The files are walked in lexical
    order, which makes the output deterministic but means that for very large
    directories WalkDir can be inefficient. WalkDir does not follow symbolic links.

The only changes here are s/Walk/WalkDir/g and s/FileInfo/DirEntry/g.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/243916 mentions this issue: io/fs: add Walk

@rsc
Copy link
Contributor Author

rsc commented Oct 21, 2020

I've retracted #41974. Let's focus on this (WalkDir) as the replacement for Walk.

What I like most about this API is that updating existing code requires almost no effort at all: change Walk to WalkDir, change FileInfo to DirEntry, maybe rename info to d, and in 90% of cases you're done and have a more efficient file system traversal that does the same thing as the original code.

@rsc
Copy link
Contributor Author

rsc commented Oct 21, 2020

There is one detail that still bothers me that might be worth changing.

Suppose you want to do a walk but ignore all testdata directories. In the WalkFunc you write:

if filepath.Base(path) == "testdata" {
    return filepath.SkipDir
}

That works fine. But behind the scenes, filepath.Walk already did a full directory read from testdata before calling the WalkFunc. Because the dir ended up being skipped, that ReadDir was entirely wasted effort. Often one reason to skip a directory is that it's big (like a cache). Doing a ReadDir on a big directory that you are going to skip is unfortunate. (It's nice that it's a ReadDir and not a Readdir, so you didn't spend tons of time calling Stat on every entry on Unix systems, but still, it's wasted effort.)

One of the complications of #41974 was defining that both entry and exit from a directory appeared in the iteration; the equivalent here would be calling the WalkFunc twice for a directory: both before and after. We clearly do not want to do that.

But I wonder if instead we should define that a directory read error (only) can result in a second call to the WalkFunc with the same path, to report the error. That is, to walk a directory, WalkDir does basically:

err := walkFn(path, d, nil)
if err != nil && err != SkipDir {
    return err // usual early out
}
if err != SkipDir {
    all, err := ReadDir(path)
    for _, child := range all {
        ... recursively handle path+/+child.Name() ...
    }
    if err != nil {
        walkFn(path, d, err)
    }
}

In addition to avoiding an expensive ReadDir that is not needed, this has the benefit of presenting the early children of a directory even if the directory read fails later in the directory. The current filepath.Walk throws away any children that were found when a read error also occurs. That's clearly a mistake, which would be good to fix in a new API but may be too subtle to fix in the existing API. And then the error is reported after the children that are available.

The one downside of course is that the WalkFunc is called twice for a directory with a read error: once for the existence of the directory itself, which is error-free, and then again when the directory read fails. This seems like a clearer separation of concerns, but at the cost of two calls with the same path.

Over on #41974 (comment), @ianlancetaylor wrote:

As far as I can tell the Exiting method exists only to clearly report an error from ReadDir on a directory. Since errors are rare, I agree that this seems like an undesirable API complication. Personally I think it would be OK for the error case to return the directory a second name--same Name, same IsDir--but with a non-nil Err.

The "extra callback only for directory read error" I'm suggesting here is the equivalent to what Ian suggested, but for the callback API. It seems like a reasonable solution to me.

What do other people think about doing this?

@bcmills
Copy link
Contributor

bcmills commented Oct 21, 2020

I expect the vast majority of WalkFunc implementations are going to do something along the lines of:

	if err != nil {
		log.Print(err)
		return nil
	}

or equivalent, or

	if err != nil {
		return err
	}

right at the beginning.

For the latter functions, the second call for a ReadDir error seems fine: it potentially does a little more work up-front for directories that are probably just going to waste, but errors are unusual so that's no big deal.

For the former functions — the ones that just log.Print and carry on — it's also no big deal. They'll produce a little more output, but they seem intended to produce best-effort output anyway.

The only WalkFuncs that will have a problem are the ones that don't return early in case of a non-nil error and are also not idempotent for a given input. But I suspect that those are rare compared to functions that return and functions that are idempotent (such as those that aggregate into a map).

@mpx
Copy link
Contributor

mpx commented Oct 22, 2020

Using a second WalkDirFunc call to report errors appears to be the best option for this function signature. That allows callers to process and consider descending some paths, and receive errors separately when there is an Open/ReadDir failure.

Adding filepath.WalkDir as a similar but faster API would be useful. That would make it relatively easy to upgrade code in the wild - I would use it now.

Just to clarify, is the intent to add a potentially different fs.FS walk API to the standard library later (eg, go1.17+)? I would prefer a better walk API for io/fs (or path/fspath?). This would benefit from broader experimention with fs.FS walk APIs before committing.

@rsc
Copy link
Contributor Author

rsc commented Oct 22, 2020

@mpx, I think io/fs needs a walk API at the start, and if filepath.WalkDir exists, then fs.WalkDir should too.
Just as the original io/fs proposal adopted filepath.Walk, if we create filepath.WalkDir, then the io/fs work will adopt WalkDir instead.

That doesn't preclude adding another one later, but it does make it unlikely without a really compelling case.
Most importantly, the combination of ReadDir and io/fs mean that some other general walker can be implemented without being required to have OS-specific code in it for efficiency, so having more sophisticated APIs in third-party packages is not as big a problem as it is today.

@jimmyfrasche
Copy link
Member

@rsc

That doesn't preclude adding another one later, but it does make it unlikely without a really compelling case.
Most importantly, the combination of ReadDir and io/fs mean that some other general walker can be implemented without being required to have OS-specific code in it for efficiency, so having more sophisticated APIs in third-party packages is not as big a problem as it is today.

That seems like an equally compelling argument for not including any official API (or starting it out in golang.org/x/ until it's worth is proven)

@tandr
Copy link

tandr commented Oct 23, 2020

(I have no beef in this game, but...)
Maybe try a small experiment to see what the gains are, by implementing something like
func collectFileList(roots []string, includePatterns []string, excludePatterns []string) []string
?

(One of just a couple cases I have ever used Walk API was to traverse, show to user, and later compress list of log files. The collect function did have params to include and exclude patterns, which could be applied to all subitems under "roots". (if include was nil or empty, it meant "everything", unless it is not in exclude). This wasn't a big exercise to do, but maybe a useful one here to see what you can gain with a new traversal api.)

@mpx
Copy link
Contributor

mpx commented Oct 23, 2020

I'm not convinced io/fs needs to have a walk implementation on day one - especially given how little time there is until the freeze to be confident of finding a good/better API. Anyone who needs to walk fs.FS will be able to implement something for go1.16 - I'm sure a few options will appear fairly quickly. Perhaps even added to golang.org/x?. Any inconvenience from the delay will be soon forgotten after a more considered API is added in go1.17.

The opportunity to explore and provide a better API should be balanced against the benefit of having a standard library implementation immediately. If the later wins, the current proposal seems relatively safe since it's close to the existing approach.

@bcmills
Copy link
Contributor

bcmills commented Oct 23, 2020

To me, the ability to easily convert callers using filepath.Walk to use io/fs and the DirEntry API is compelling. Any cleaner / simpler / different-style API is not likely to address that use-case, so it seems to me that we will want some API very close to filepath.Walk regardless of whether we also decide to add an iterator-based API.

WalkDir addresses the known use-case of “something close enough to filepath.Walk to make porting easy”, so I think we should go ahead and add it. Then we can decide the separate use-case of “something simpler or less error-prone than filepath.Walk” without the time pressure of holding up the migration of existing code to io/fs.

@rsc
Copy link
Contributor Author

rsc commented Oct 28, 2020

@mpx, I hear you, but as I noted above, I disagree. io/fs should be as capable as the existing library routines.
If we have filepath.WalkDir, we can have fs.WalkDir too.

@rsc
Copy link
Contributor Author

rsc commented Oct 28, 2020

Based on the discussion above, this (including the extra callback for reporting directory read errors) seems like a likely accept.

@mpx
Copy link
Contributor

mpx commented Oct 29, 2020

I agree w/@bcmills argument for supporting easy migration of existing code - this does seem like the best approach for now. Better APIs can still be provided elsewhere, and potentially considered for the standard library if there is enough benefit.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/266240 mentions this issue: path/filepath: add WalkDir

@qingyunha
Copy link
Contributor

// The files are walked in lexical order, which makes the output deterministic but means that for very
// large directories Walk can be inefficient.

I hava a question here may not related to this proposal. why we need this deterministic ? it is unfortunate for a large directory.
Seems Python os.scandir and GNU find does not sort the dir. Can we just relay on the order of ReadDir?

@mpx
Copy link
Contributor

mpx commented Oct 30, 2020

A lot of existing code benefits from assuming paths are walking in lexigraphical order. The proposed (sorted) API will continue to simplify new usage, and support easier migration of old code.

Ideally, it would be good to support some different tradeoffs. Eg, conserving memory vs file descriptors, performance vs ease of use,... However, this isn't practical with the proposed API since it would need to be parameterised.

It will be easy enough to create a different implementation/API supporting a different set of tradeoff when it matters. Eg, some code will see significant performance improvements from keeping file descriptors open and processing in DirEntry order. Maybe one of these APIs might be clean enough and desirable enough to add to the standard library in future.

@rsc
Copy link
Contributor Author

rsc commented Nov 4, 2020

No change in consensus, so accepted.

@rsc rsc changed the title proposal: path/filepath: add WalkDir (Walk using DirEntry) path/filepath: add WalkDir (Walk using DirEntry) Nov 4, 2020
@rsc rsc modified the milestones: Proposal, Backlog Nov 4, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/267719 mentions this issue: all: update to use filepath.WalkDir instead of filepath.Walk

@dmitshur
Copy link
Contributor

dmitshur commented Nov 5, 2020

Reopening because CL 266240 was reverted in CL 267798; it needs to be re-sent.

@dmitshur dmitshur reopened this Nov 5, 2020
@dmitshur dmitshur modified the milestones: Backlog, Go1.16 Nov 5, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/267887 mentions this issue: path/filepath: add WalkDir

gopherbot pushed a commit that referenced this issue Nov 6, 2020
This commit is a copy of filepath.WalkDir adapted to use fs.FS
instead of the native OS file system. It is the last implementation
piece of the io/fs proposal.

The original io/fs proposal was to adopt filepath.Walk, but we
have since introduced the more efficient filepath.WalkDir (#42027),
so this CL adopts that more efficient option instead.

(The changes in path/filepath bring the two copies more in line
with each other. The main change is unembedding the field
in statDirEntry, so that the fs.DirEntry passed to the WalkDirFunc
for the root of the tree does not have any extra methods.)

For #41190.

Change-Id: I9359dfcc110338c0ec64535f22cafb38d0b613a6
Reviewed-on: https://go-review.googlesource.com/c/go/+/243916
Trust: Russ Cox <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Rob Pike <[email protected]>
gopherbot pushed a commit that referenced this issue Dec 2, 2020
Now that filepath.WalkDir is available, it is more efficient
and should be used in place of filepath.Walk.
Update the tree to reflect best practices.

As usual, the code compiled with Go 1.4 during bootstrap is excluded.
(In this CL, that's only cmd/dist.)

For #42027.

Change-Id: Ib0f7b1e43e50b789052f9835a63ced701d8c411c
Reviewed-on: https://go-review.googlesource.com/c/go/+/267719
Trust: Russ Cox <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Rob Pike <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/285595 mentions this issue: doc/go1.16.html: mention path/filepath/WalkDir

gopherbot pushed a commit that referenced this issue Jan 22, 2021
For #40700
For #42027

Change-Id: Ifb73050dfdab21784fa52d758ad9c408e6489684
Reviewed-on: https://go-review.googlesource.com/c/go/+/285595
Trust: Ian Lance Taylor <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
wedaly added a commit to aretext/aretext that referenced this issue Apr 17, 2021
Go 1.16 adds a more efficient routine to walk the filesystem.
See golang/go#42027

This speeds up traversal of a directory with 70K files from
~1.3 seconds to ~0.9 seconds.
@golang golang locked and limited conversation to collaborators Jan 22, 2022
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc rsc removed this from Proposals Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants