Initial draft of file enumeration design doc. #24

JeremyKuhne · 2017-11-30T00:19:01Z

This is a proposal to provide a performant extensibility point for file enumeration in System.IO.

This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic

@terrajobst, @pjanotti, @danmosemsft

poke · 2017-11-30T07:40:27Z

I have to say that the API looks quite complicated. The fact that your sample is basically building a helper method on top of the proposed API kind of shows that consuming it might be too difficult and it is maybe not meant as a high-level way to enumerate the file system after all. I fear that people will just end up using the existing methods which they can then filter using comfortable LINQ instead of dealing with this new API. And I certainly don’t believe this new API should be limited to those that care about its benefits—ideally, everybody should be able to use it without having to invest time to learn completely new API styles.

When I saw the FindPredicates.NotDotOrDotDot(ref findData) && !FindPredicates.IsDirectory(ref findData) part, I was wondering whether it wasn’t possible to at least make those extensions methods, so one could do findData.NotDotOrDotDot().IsNotDirectory() which would make it fluent and a lot easier to use. But I don’t know if extension methods support refs (yet)?

Wouldn’t we be able to convert this to a fluid LINQ-like API, so we could do something like FindFiles(directory, recursive: true).Where(f => f.NotDotOrDotDot() && !f.IsDirectory()).AsFullPath()? Or will that not allow us to avoid allocations?

svick · 2017-12-03T15:37:34Z

My comments:

The API description lists several types as internal, I think those were supposed to be public.
What's the point of having a FindEnumerable constructor, and also Directory.Enumerate? If they're different, that's confusing. If they're the same, then I think the constructor is unnecessary.
Would it make sense to include state inside RawFindData? That would make the signatures and lambdas simpler, especially if state was not used.
Why is RawFindData named like that? Wouldn't just FindData suffice?
If common predicates and transforms are going to be provided, I think their signatures should be the same as what the methods accept, so that you can use them as method groups, e.g.: Directory.Enumerate<string, object>(path, FindTransforms.AsFullPath<object>, FindPredicates.IsDirectory<object>).
If state is considered optional, then it could make sense to provide overloads where you don't have to specify its type (so the example above would become just Directory.Enumerate<string>(path, FindTransforms.AsFullPath, FindPredicates.IsDirectory)).
NotDotOrDotDot is fun to say out loud but I think it's unclear what it's supposed to do. If it's meant to ignore files whose names start with a dot, which are considered hidden on UNIX, maybe it would make sense to introduce some platform-independent notion of "hidden files"?

@poke An important part of the proposal is limiting recursion. How would you do that using a LINQ-like API?

JeremyKuhne · 2017-12-04T20:02:25Z

I have to say that the API looks quite complicated.

@poke It is a bit complicated due to the need to keep allocations to a minimum. It is meant to be an advanced extension point that others can build solutions on for high-impact scenarios that aren't broadly applicable. Think MSBuild trying to find all files with a set of extensions under a certain folder tree as one example.

@svick

The API description lists several types as internal, I think those were supposed to be public.

Thanks, I'll fix them. This started from our current internal Windows implementation. :)

What's the point of having a FindEnumerable constructor, and also Directory.Enumerate?

We don't need to have both if we seal the class. I'm not settled on what to do there.

Would it make sense to include state inside RawFindData? That would make the signatures and lambdas simpler, especially if state was not used.

dotnet/corefx#25691 explores doing this.

Why is RawFindData named like that? Wouldn't just FindData suffice?

I don't think we would introduce a higher level abstraction, so FindData is probably ok.

If common predicates and transforms are going to be provided, I think their signatures should be the same as what the methods accept

Was trying to encourage leveraging that anonymous method delegates are cached. I'll definitely give that more thought.

If state is considered optional, then it could make sense to provide overloads where you don't have to specify its type

Seems rational at first- I'll try it out

NotDotOrDotDot is fun to say out loud but I think it's unclear what it's supposed to do.

This one is a struggle. Could go with NotJustDotOrDotDot?

CZEMacLeod · 2017-12-04T20:43:28Z

@svick @JeremyKuhne
Assume NotDotOrDotDot is a predicate that excludes the current directory and parent directory entries that are returning in directory enumerations e.g. Not . and .. I think the name is fine as is, although it might make more sense in fluent form as @poke suggests? In which case it might be findData.ExceptDotAndDotDot()?
If this is referring to Linux type hidden files/folders starting with a . then the name is confusing.
Looking at your example case, should the predicate include the Not at all?

public static FindEnumerable<string, string> GetFiles(string directory,
    string expression = "*",
    bool recursive = false)
{
    return new FindEnumerable<string, string>(
        directory,
        (ref RawFindData findData, string expr) => FindTransforms.AsFullPath(ref findData),
        (ref RawFindData findData, string expr) =>
        {
            return !FindPredicates.DotOrDotDot(ref findData)
                && !FindPredicates.IsDirectory(ref findData)
                && DosMatcher.MatchPattern(expr, findData.FileName, ignoreCase: true);
        },
        state: DosMatcher.TranslateExpression(expression),
        options: recursive ? FindOptions.Recurse : FindOptions.None);
}

Also, aren't . and .. directories and would be excluded by the IsDirectory predicate anyway?

One other thing is that if RawFindData can contain directory entries for enumerating and filtering directories as well as files, then perhaps FileName is not the right propery name?

JeremyKuhne · 2017-12-04T21:03:37Z

it might make more sense in fluent form as @poke suggests?

It would make more sense, but would defeat key design goals, notably to minimize allocations. The struct has to be a ref struct to make this possible. If we had dotnet/csharplang#186 it would be worth exploring.

Also, aren't . and .. directories and would be excluded by the IsDirectory predicate anyway?

They are, good catch.

perhaps FileName is not the right propery name?

Perhaps Name would be ok.

CZEMacLeod · 2017-12-04T21:24:40Z

@JeremyKuhne Just finished reading dotnet/csharplang#186 so I now understand why this doesn't (currently) have fluent / extension methods.
I must admit I mainly write in vb.net and didn't realise that C# couldn't do ByRef extension methods.

JeremyKuhne · 2017-12-04T21:30:16Z

@CZEMacLeod It is one of my top wishlist items. That and dotnet/csharplang#187. :)

CZEMacLeod · 2017-12-04T22:20:34Z

@JeremyKuhne I notice that some of this comes off the back of MSBuild 'Globbing' (as an aside - is that actually a word?)
Not wanting to take this off-topic at all but I was involved in a discussion on a very similar issue on an asp.net core equivalent of the old asp.net optimization. The feature request was to enable recursively searching for files to bundle using patterns.
It was noted there that this should be able to work over IFileProvider rather that the raw filesystem to allow for the same functionality when using 'virtual' files of one sort or another - e.g. embedded resources, zips, dynamically generated files, etc.
Do you think it might be possible to ensure that this 'concrete' filesystem API can also work this way so we don't need to learn 2 (or more) APIs?
For reference the conversation is Shazwazza/Smidge#47
I believe Nuget also uses (used?) an implementation of IFileProvider in its pack and restore commands to determine files to include - and again includes a lot of pattern matching.
It might be worth considering the scenarios for NuGet and ASP.Net in this spec.

JeremyKuhne · 2017-12-04T22:50:33Z

'Globbing' (as an aside - is that actually a word?)

It is. https://en.wikipedia.org/wiki/Glob_%28programming%29

Do you think it might be possible to ensure that this 'concrete' filesystem API can also work this way so we don't need to learn 2 (or more) APIs?

I'm open to suggestions on how we could make FindData more flexible without creating too much overhead. Biggest thing is keeping allocations down, but speed is important too. As System.IO currently has a version of this proposal, we can experiment and see real-world impact of design options. I've been measuring impact on >400K file sets on multiple types of drives (SSD, HDD, DriveSpace arrays...).

terrajobst · 2017-12-04T23:17:44Z

accepted/file-enumeration.md

+
+Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links.
+
+These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.


We can address these restrictions and provide performant, configurable enumeration. [](start = 113, length = 83)

You want to make sure that your intro has a concrete proposal. Maybe something like:

This proposes a new set of primitive file and directory traversal APIs that are optimized for providing more flexibility while keeping the overhead to a minimum so that enumeration becomes both more prowerful as well as more performant. #Closed

terrajobst · 2017-12-04T23:21:23Z

accepted/file-enumeration.md

+
+These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.
+
+## Scenarios and User Experience


Scenarios and User Experience [](start = 3, length = 29)

Your scenarios way too short. You want to make those headings and show some sample code consuming the APIs you're proposing. Scenarios are meant to illustrate the value your APIs are adding. They shouldn't be longer than a few paragraphs but they also shouldn't be that abstract. #Closed

terrajobst · 2017-12-04T23:22:08Z

accepted/file-enumeration.md

+## Requirements
+
+1. Filtering based on common file system data is possible
+	a. Name


a. [](start = 1, length = 2)

Doesn't render on GitHub as an enumeration. Use numbers or bullet points instead. #Closed

terrajobst · 2017-12-04T23:24:11Z

accepted/file-enumeration.md

+	a. Name
+	b. Attributes
+	c. Time stamps
+    d. File size


d. File size [](start = 3, length = 13)

What doesn't pop here is how the filter an be provided. I think you want to call out explicitly that the filter is a custom function and thus supports all sorts of filtering criteria, including calling custom functions and arbitrary combinators (AND, OR, NOT etc). #Closed

terrajobst · 2017-12-04T23:25:26Z

accepted/file-enumeration.md

+	b. Attributes
+	c. Time stamps
+    d. File size
+2. Result transforms can be of any type


Result transforms can be of any type [](start = 3, length = 36)

I think this needs more details. Basically, you need to call out how this would differ, from, say, regular Linq. #Closed

terrajobst · 2017-12-04T23:26:26Z

accepted/file-enumeration.md

+	c. Time stamps
+    d. File size
+2. Result transforms can be of any type
+3. We provide common filters and transforms


We provide common filters and transforms [](start = 3, length = 40)

I'd explicity list the ones you think are common, such as recursively searching by file name prefix, suffix, and extensions (or whatever the ones are you have in mind) #Closed

terrajobst · 2017-12-04T23:27:39Z

accepted/file-enumeration.md

+    d. File size
+2. Result transforms can be of any type
+3. We provide common filters and transforms
+4. Recursive behavior is configurable


ecursive behavior is configurable [](start = 4, length = 33)

I think you want more detail here. I assume you want ON, OFF, and delegate so that user code can use the criteria listed in (1) decide whether to walk a sub tree. #Closed

terrajobst · 2017-12-04T23:28:08Z

accepted/file-enumeration.md

+2. Result transforms can be of any type
+3. We provide common filters and transforms
+4. Recursive behavior is configurable
+5. Error handling is configurable


Error handling is configurable [](start = 3, length = 30)

Presumably you want to call out that this should be able to avoid throwing & catching exeptions. #Closed

terrajobst · 2017-12-04T23:29:00Z

accepted/file-enumeration.md

+1. MSBuild can custom filter filesystem entries with limited allocations and form the results in any desired format.
+2. Users can build custom enumerations utilizing completely custom or provided commonly used filters and transforms.
+
+## Requirements


This section is intended to be empty; the requirements you want to have should go under goals and the ones you want to scope out under non-goals. #Closed

terrajobst · 2017-12-04T23:30:39Z

accepted/file-enumeration.md

+### Non-Goals
+
+1. API will not expose platform specific data
+3. Error handling configuration is fully customizable


Error handling configuration is fully customizable [](start = 3, length = 50)

That needs more detail as a goal says Error handling is configurable while a non-goal says Error handling configuration is fully customizable. You need to draw say enough to that the reader can draw a line in their head of what's in and what's out #Closed

Got it

In reply to: 154807326 [](ancestors = 154807326)

danmoseley · 2017-12-04T23:31:12Z

accepted/file-enumeration.md

+            FindOptions options = FindOptions.None);
+    }
+
+    public static class DirectoryInfo


DirectoryInfo is not static of course #Closed

Yup, misstype

In reply to: 154807425 [](ancestors = 154807425)

terrajobst · 2017-12-04T23:32:20Z

accepted/file-enumeration.md

+    /// <summary>
+    /// Delegate for filtering out find results.
+    /// </summary>
+    internal delegate bool FindPredicate<TState>(ref RawFindData findData, TState state);


internal [](start = 4, length = 8)

Presumably you mean public, not internal, right? #Closed

terrajobst · 2017-12-04T23:32:27Z

accepted/file-enumeration.md

+    /// <summary>
+    /// Delegate for transforming raw find data into a result.
+    /// </summary>
+    internal delegate TResult FindTransform<TResult, TState>(ref RawFindData findData, TState state);


internal [](start = 4, length = 8)

Presumably you mean public, not internal, right? #Closed

Yep, copy paste fail

In reply to: 154807630 [](ancestors = 154807630)

danmoseley · 2017-12-04T23:32:51Z

accepted/file-enumeration.md

+
+```
+
+### DosMatcher


I assume you would have a RegexMatcher as well.

Ideally. We might have to have less than ideal perf to start with if we don't get the Span overloads on Regex at first.

In reply to: 154807693 [](ancestors = 154807693)

terrajobst · 2017-12-04T23:34:27Z

accepted/file-enumeration.md

+}
+```
+
+### Samples


Samples [](start = 4, length = 7)

I'd take both the DosMatcher as well as this sample and make it part of a scenario. #Closed

Ok

In reply to: 154807983 [](ancestors = 154807983)

terrajobst

This doc is a great start. Before merging, I think we need to tweak a few things:

Scenarios are too vague
Goals and non-goals need to be clarified a bit more

But I love where this is heading!

danmoseley · 2017-12-04T23:37:54Z

accepted/file-enumeration.md

+Getting full path of all files matching a given name pattern (close to what FindFiles does, but returning the full path):
+
+``` C#
+public static FindEnumerable<string, string> GetFiles(string directory,


This sample has a heavy cognitive load:

Tuples

ref

Two delegates, likely wrapping other BCL API, which intellisense won't help find

concept of passing in a state at the start of an enumeration

It's the ultimate API, but it is likely intimidating for someone who simply wants look for files matching a regex, without caring about every drop of performance

For sure. Doing this doesn't preclude us adding simpler overloads, and I would actually expect to eventually.

In reply to: 154808558 [](ancestors = 154808558)

I think it would be helpful to propose those here.

danmoseley · 2017-12-04T23:38:46Z

@KrzysztofCwalina I'd be interested in your feedback on the API.

/cc @stephentoub who's out.

svick · 2017-12-04T23:44:34Z

Regarding NotJustDotOrDotDot: I get that . and .. are commonly included in directory listings. But are there actually cases when they are useful when enumerating a directory? If not, NotJustDotOrDotDot could be implicit, making any issues with naming it moot.

poke · 2017-12-05T10:47:38Z

@svick

@poke An important part of the proposal is limiting recursion. How would you do that using a LINQ-like API?

Not sure what exactly you mean here, but I’m thinking of a IQueryable way of constructing a query first before executing it against the file system. So you would just create a query configuration using a fluent syntax which could then be executed in a very similar way to this API.

@JeremyKuhne

@poke It is a bit complicated due to the need to keep allocations to a minimum.

Is this about allocations while enumerating the files, or would you also want to reduce allocations for the query itself?

It is meant to be an advanced extension point that others can build solutions on for high-impact scenarios that aren't broadly applicable.

I understand that but I still believe that we should try to design an API that will allow others to use it just as well for less critical applications. At least as long as we’re still in an early design phase, we could at least try to evaluate other options here.

Since dotnet/csharplang#186 came up a few times now; what is the time plan for this API? Would waiting for / building upon ref extension methods be an acceptable thing here? That might allow us to make everything a bit simpler.

Anyway, I’m thinking of an implementation like this (rough and unflexible sketch, just to show the direction):

public class FindEnumeratorQuery
{
    private string _directory;
    private string _state;
    private FindOptions _options = FindOptions.None;
    private IList<FindPredicate<string>> _predicates = new List<FindPredicate<string>>();
    private FindTransform<TResult, TState> _transform;

    public FindEnumeratorQuery(string directory, bool recursive = false)
    {
        _directory = directory;
        if (recursive)
            _options |= FindOptions.Recurse;
    }

    public FindEnumeratorQuery MatchPattern(string expression, bool ignoreCase = false)
    {
        _state = DosMatcher.TranslateExpression(expression);
        _predicates.Add((ref RawFindData findData, string expr) => DosMatcher.MatchPattern(expr, findData.FileName, ignoreCase: ignoreCase));
    }
    public FindEnumeratorQuery IsDirectory() { … }
    public FindEnumeratorQuery IsNotDirectory() { … }

    public FindEnumeratorQuery AsFullPath() { … }

    public FindEnumerable<string, string> GetFiles<TResult>()
    {
        return new FindEnumerable<string, string>(_directory, _transform,
            CombinePredicates(_predicates),
            state: _state, options: _options);
    }

    private static FindPredicate<string> CombinePredicates(IEnumerable<FindPredicate<string>> predicates) {…}
}

I believe that, if we accept the allocations in the query construction, we could actually design this in a way that the API becomes more friendly without hurting the performance during the actual enumeration.

MichalStrehovsky · 2017-12-05T11:44:44Z

The goal of this is to provide the best performance we can in a cross-platform abstracted way

I don't see the abstraction here; this API basically assumes that the best way to query the filesystem on the underlying platform is with a FindFirstFile-like API.

The approach that @poke is going for (with IQueryable-like fluent syntax) would let the underlying implementation choose the best possible way to get the information user asks for.

The design as proposed here is focusing too much on making just the managed parts fast, forgoing the fact that the underlying platform also needs to do allocations (and I/O, plus potentially networking operations) to satisfy the query. We might be able to get significant savings from limiting that work, but I don't see how the provider could do that, given an API surface that doesn't let the provider know what the user is going for. The time spent doing that extra I/O is orders of magnitude bigger than time spent doing a Gen0 collection of a couple short lived objects (that are per query - not per result).

Even just looking at Windows - with the proposed API, do we know whether we should call FindFirstFile with FIND_FIRST_EX_LARGE_FETCH?

More to be made, pushing what I have to save progress.

jnm2 · 2017-12-05T20:43:18Z

Even just looking at Windows - with the proposed API, do we know whether we should call FindFirstFile with FIND_FIRST_EX_LARGE_FETCH?

I have business code calling FindFirstFile with FIND_FIRST_EX_LARGE_FETCH because each of the alternatives I tried was measured to be thousands of times slower for network shares with thousands of files. I hope you either make this the default or provide control over it.

terrajobst · 2017-12-05T22:51:41Z

accepted/file-enumeration.md

+## Requirements
+
+
+### Goals


Goals [](start = 1, length = 8)

The primary goal seems to be:

Provide an API shape that is as expressive as the current API shape while also reducing the number of allocations to a minimum. We're willing to compromise some usability, but the complexity should be about the same as many of the other Span<T>-based APIs. #Closed

terrajobst · 2017-12-05T23:14:27Z

accepted/file-enumeration.md

+	- Predicate based on FileData
+5. Can avoid throwing access denied exceptions
+
+### Non-Goals


One non-goal to call out is that we don't intend to replace the existing IO APIs -- these are meant to be advanced APIs for folks that really have to care about performance. #Closed

JeremyKuhne · 2017-12-06T01:06:14Z

@poke

Is this about allocations while enumerating the files, or would you also want to reduce allocations for the query itself?

I believe that, if we accept the allocations in the query construction

It is primarily about reducing string allocations for results we don't care about- which is the primary measured perf problem we're trying to solve. We also, however, want to keep the allocations constrained on construction where possible. Additionally we want to keep performance high. Using anonymous methods gets you caching of the delegates, which is part of why I settled on the current pattern after trying a number of others. String.Create() also lets you at internals, which helped shape this.

@MichalStrehovsky

I don't see the abstraction here; this API basically assumes that the best way to query the filesystem on the underlying platform is with a FindFirstFile-like API.

This API is trying to provide the set of data that goes into making a FileSystemInfo, letting you provide logic that filters out results while pushing string allocations as late as possible. That is the abstraction, and the problem it is meant to solve. The managed allocations are the biggest impact, by far, on measured directory iteration performance in key workloads like MSBuild and VS. Making the change as shown (which is currently in Core) improved our own perf two-fold on large filesets (checking thousands or hundreds of thousands of files). We can and will provide common predicates for both platforms and optimize further on those where possible/warranted by providing platform specific implementations that get at more internals.

The design as proposed here is focusing too much on making just the managed parts fast

That is what this proposal is. It isn't to provide direct access to the OS APIs- it's to provide access to the internals of our iterator. I don't see a way to extend the providing of RawFindData on the iterator without making this way more complicated or defeating the allocation reductions goal.

The time spent doing that extra I/O

What extra I/O? There is no way to filter out the disk access for a directory enumeration. The file system has to get all of the data to make a decision. How many extra allocations you have up the stack is what makes all of the difference. I removed a fundamental block on Windows by simply not calling FindFirstFile. It has to copy the NT structure into a Win32 structure and doesn't have context for a "session". Additionally, it allocates a new 4K buffer for every single directory you access.

@jnm2

I have business code calling FindFirstFile with FIND_FIRST_EX_LARGE_FETCH because each of the alternatives I tried was measured to be thousands of times slower for network shares with thousands of files.

I've skipped FindFirstFile entirely and use our own 4K buffer. I saw no impact on local searches when I played with various sizes here. I'm happy to make the buffer size configurable and would love to get some real world data on the impact of buffer sizes on shares. If you have the opportunity to play with tweaking what we have please do. I'll try and experiment some here as well.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/FindEnumerable.Windows.cs#L114

We could provide a single flag option for UseLargeBuffer or something similar. Again, data to inform that would be fantastic.

Clean up flag goals a bit.

JeremyKuhne · 2017-12-07T22:31:07Z

@davidwrighton If you have a chance to look at this I'd appreciate it. I'm also interested in your thoughts on the impact of trying to put an interface on the data struct and passing the interface through the delegates rather than the struct itself (for testability, etc.).

terrajobst · 2017-12-09T03:11:12Z

accepted/file-enumeration.md

+
+To get full paths for files, one would currently have to do the following:
+
+```


Add charp to get syntax highlighting

Lay out a potential variant for FindData to allow unit testing.

terrajobst · 2017-12-11T17:10:38Z

accepted/file-enumeration.md

+
+### Potential testability variant
+
+Having `FindData` be a struct is critical to meet the performance goals for the feature. To allow testing of Filters and Predicates we could also expose an interface of `IFindData`:


We should probably discuss in the API review in more detail, but I feel I'm not sure this minor tweak is worth the complexity. It seems in order to support testing, we'd need to allow passing something around that can serve to produce FindData<T> instances. In other words, for testing it seems just being able to to query a single find data is very limited, compared to an IFileSystem (whatever that would look like).

Also add some more details for future thoughts.

migueldeicaza · 2017-12-12T16:19:25Z

My comments:

I find this new API difficult to read, I am hoping that this can live in a parallel universe and not in the core of the .NET frameworks.

Just how much memory is being allocated today in these scenarios that you are looking for, and what are the savings with this design?
Has someone looked into rewriting the alloc-happy code in the framework to avoid this, like Mono used to?

The examples shown also rely on APIs that require FileInfos to be created, and are themselves not a pinnacle of low-allocation work, starting with the IEnumerable right there, and the use of LINQ ☺

The first one, can be rewritten without allocating FileInfos, like this:

foreach (var t in Directory.GetFiles ("/tmp")) Console.WriteLine (Path.GetFullPath (t))

The second example, filtering for extensions that is given is a poor implementation, if you want to match on extensions, you can let Directory do the work:

foreach (var t in Directory.GetFiles (“*.ext”)) Console.WriteLine (Path.GetFullPath (t))

The last one addresses all three items from the list of pros on that document.

I checked both of those, like this, and not a single FileInfo was created:

mac$ MONO_ENV_OPTIONS='--trace=T:System.IO.FileInfo' csharp -e 'foreach (var t in System.IO.Directory.GetFiles ("/etc", "*.conf")) Console.WriteLine (System.IO.Path.GetFullPath (t))' | grep FileInfo
mac$ MONO_ENV_OPTIONS='--trace=T:System.IO.FileInfo' csharp -e 'foreach (var t in System.IO.Directory.GetFiles ("/etc", "*")) Console.WriteLine (System.IO.Path.GetFullPath (t))' | grep FileInfo

As the doc points out, recursive directory scanning will be more expensive on Unix, no matter what you do, as you need to probe each file listed that you get back from the OS. But if we move the scanning to unmanaged code, no FileInfos will ever be produced, nor stats surfaced on Unix.

I just looked at the implementation in CoreFX (which we just took for Mono) and it looks allocation happy by default, something that the original Mono version did not do. Mono used to do the work in unmanaged code, and surfaced a properly-sized array, while the new implementation creates a List<T> to scan, and then does a ToArray() at the end.

Some questions:

danmoseley · 2017-12-12T16:31:45Z

@migueldeicaza did the questions you mention get cut off? Great feedback BTW

JeremyKuhne · 2017-12-12T19:44:01Z

I find this new API difficult to read, I am hoping that this can live in a parallel universe and not in the core of the .NET frameworks.

We won't be getting rid of the existing APIs. This is not intended for general day-to-day use- we'll keep the "simple" APIs as the primary entry points. It's meant to allow advanced users such as MSBuild to do custom filtering, caching, and transforms without writing their own native implementations.

The first one, can be rewritten without allocating FileInfos, like this

Yes, but you trade in extra string allocations and an expensive normalization call for every entry. Same for the second example. I kept these initial two relatively simple to illustrate usage. I also did not fully optimize the third example to keep it conceptually simpler (e.g. I wouldn’t use Linq in reality). I’ll add some comments.

I just looked at the implementation in CoreFX (which we just took for Mono) and it looks allocation happy by default

For the CoreFX implementation are you looking at what is in master? Master was changed recently to use an internal implementation that looks like this proposal (for Windows). Even with the changes, our APIs that return arrays currently use IEnumerable internally. I've opened an issue about potentially optimizing those scenarios. (https://github.com/dotnet/corefx/issues/25863)

JeremyKuhne · 2017-12-12T21:33:59Z

Note: This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic.

terrajobst · 2017-12-13T01:08:37Z

@migueldeicaza

By the way, merging doesn't mean we're ignoring your feedback. It just means we've accepted the business goal and are committed to ship the feature. The design will likely evolve, especially after an API review. So the discussion is hardly over!

powercode · 2018-01-30T12:56:39Z

PowerShell, specifically the FileSystemProvider, would be a customer for this to.

PowerShell today has perf issues with enumerating files recursively and getting the Mode and LinkType synthetic properties.

It would be good to make sure that these new APIs will work well with that core scenario. @SteveL-MSFT.

Initial draft of file enumeration design doc.

9178340

JeremyKuhne mentioned this pull request Dec 4, 2017

Explore making state part of the find data dotnet/corefx#25691

Closed

terrajobst reviewed Dec 4, 2017

View reviewed changes

danmoseley reviewed Dec 4, 2017

View reviewed changes

terrajobst reviewed Dec 4, 2017

View reviewed changes

danmoseley reviewed Dec 4, 2017

View reviewed changes

terrajobst reviewed Dec 4, 2017

View reviewed changes

terrajobst suggested changes Dec 4, 2017

View reviewed changes

danmoseley reviewed Dec 4, 2017

View reviewed changes

Updates per feedback

fea0bb2

More to be made, pushing what I have to save progress.

terrajobst reviewed Dec 5, 2017

View reviewed changes

JeremyKuhne added 3 commits December 6, 2017 15:10

Rewrite scenarios and goals/non goals

32f6440

Add a flag for skipping locking

cb3288b

Clean up flag goals a bit.

Minor cleanup

76ae3d5

JeremyKuhne added 2 commits December 7, 2017 17:06

Some Q&A

7edad28

Missing <TState> on FindData struct.

295198b

terrajobst reviewed Dec 9, 2017

View reviewed changes

terrajobst approved these changes Dec 9, 2017

View reviewed changes

Add testable FindData variant

9a010fc

Lay out a potential variant for FindData to allow unit testing.

terrajobst reviewed Dec 11, 2017

View reviewed changes

Move future discussion to it's own header

9bdf78a

Also add some more details for future thoughts.

Add more Q&A

50271fb

JeremyKuhne merged commit a279afe into dotnet:master Dec 13, 2017

JeremyKuhne deleted the enumeration branch December 13, 2017 00:51

powercode mentioned this pull request Jan 30, 2018

Get-ChildItem -Recurse treats literals passed to -Path like wildcard expressions PowerShell/PowerShell#5699

Closed

This was referenced Jan 31, 2020

Investigate improving GetFiles array perf. dotnet/runtime#24427

Closed

Add file enumeration extensibility points dotnet/runtime#24429

Closed


		Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links.

		These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.


		These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.

		## Scenarios and User Experience


		To get full paths for files, one would currently have to do the following:

		```


		### Potential testability variant

		Having `FindData` be a struct is critical to meet the performance goals for the feature. To allow testing of Filters and Predicates we could also expose an interface of `IFindData`:

Initial draft of file enumeration design doc. #24

Initial draft of file enumeration design doc. #24

Conversation

JeremyKuhne commented Nov 30, 2017 • edited Loading

poke commented Nov 30, 2017

svick commented Dec 3, 2017

JeremyKuhne commented Dec 4, 2017

CZEMacLeod commented Dec 4, 2017

JeremyKuhne commented Dec 4, 2017

CZEMacLeod commented Dec 4, 2017

JeremyKuhne commented Dec 4, 2017

CZEMacLeod commented Dec 4, 2017

JeremyKuhne commented Dec 4, 2017

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley Dec 4, 2017 • edited by terrajobst Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrajobst Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrajobst left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley commented Dec 4, 2017

svick commented Dec 4, 2017

poke commented Dec 5, 2017

MichalStrehovsky commented Dec 5, 2017

jnm2 commented Dec 5, 2017 • edited Loading

terrajobst Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Goals [](start = 1, length = 8)

terrajobst Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

JeremyKuhne commented Dec 6, 2017

JeremyKuhne commented Dec 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migueldeicaza commented Dec 12, 2017

danmoseley commented Dec 12, 2017

JeremyKuhne commented Dec 12, 2017

JeremyKuhne commented Dec 12, 2017

terrajobst commented Dec 13, 2017

powercode commented Jan 30, 2018

JeremyKuhne commented Nov 30, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

danmoseley Dec 4, 2017 •

edited by terrajobst

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst Dec 4, 2017 •

edited

Loading

terrajobst left a comment •

edited

Loading

jnm2 commented Dec 5, 2017 •

edited

Loading

terrajobst Dec 5, 2017 •

edited

Loading

terrajobst Dec 5, 2017 •

edited

Loading