Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of file enumeration design doc. #24

Merged
merged 10 commits into from
Dec 13, 2017

Conversation

JeremyKuhne
Copy link
Member

@JeremyKuhne JeremyKuhne commented Nov 30, 2017

This is a proposal to provide a performant extensibility point for file enumeration in System.IO.

This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic

@terrajobst, @pjanotti, @danmosemsft

@poke
Copy link

poke commented Nov 30, 2017

I have to say that the API looks quite complicated. The fact that your sample is basically building a helper method on top of the proposed API kind of shows that consuming it might be too difficult and it is maybe not meant as a high-level way to enumerate the file system after all. I fear that people will just end up using the existing methods which they can then filter using comfortable LINQ instead of dealing with this new API. And I certainly don’t believe this new API should be limited to those that care about its benefits—ideally, everybody should be able to use it without having to invest time to learn completely new API styles.

When I saw the FindPredicates.NotDotOrDotDot(ref findData) && !FindPredicates.IsDirectory(ref findData) part, I was wondering whether it wasn’t possible to at least make those extensions methods, so one could do findData.NotDotOrDotDot().IsNotDirectory() which would make it fluent and a lot easier to use. But I don’t know if extension methods support refs (yet)?

Wouldn’t we be able to convert this to a fluid LINQ-like API, so we could do something like FindFiles(directory, recursive: true).Where(f => f.NotDotOrDotDot() && !f.IsDirectory()).AsFullPath()? Or will that not allow us to avoid allocations?

@svick
Copy link

svick commented Dec 3, 2017

My comments:

  1. The API description lists several types as internal, I think those were supposed to be public.
  2. What's the point of having a FindEnumerable constructor, and also Directory.Enumerate? If they're different, that's confusing. If they're the same, then I think the constructor is unnecessary.
  3. Would it make sense to include state inside RawFindData? That would make the signatures and lambdas simpler, especially if state was not used.
  4. Why is RawFindData named like that? Wouldn't just FindData suffice?
  5. If common predicates and transforms are going to be provided, I think their signatures should be the same as what the methods accept, so that you can use them as method groups, e.g.: Directory.Enumerate<string, object>(path, FindTransforms.AsFullPath<object>, FindPredicates.IsDirectory<object>).
  6. If state is considered optional, then it could make sense to provide overloads where you don't have to specify its type (so the example above would become just Directory.Enumerate<string>(path, FindTransforms.AsFullPath, FindPredicates.IsDirectory)).
  7. NotDotOrDotDot is fun to say out loud but I think it's unclear what it's supposed to do. If it's meant to ignore files whose names start with a dot, which are considered hidden on UNIX, maybe it would make sense to introduce some platform-independent notion of "hidden files"?

@poke An important part of the proposal is limiting recursion. How would you do that using a LINQ-like API?

@JeremyKuhne
Copy link
Member Author

I have to say that the API looks quite complicated.

@poke It is a bit complicated due to the need to keep allocations to a minimum. It is meant to be an advanced extension point that others can build solutions on for high-impact scenarios that aren't broadly applicable. Think MSBuild trying to find all files with a set of extensions under a certain folder tree as one example.

@svick

The API description lists several types as internal, I think those were supposed to be public.

Thanks, I'll fix them. This started from our current internal Windows implementation. :)

What's the point of having a FindEnumerable constructor, and also Directory.Enumerate?

We don't need to have both if we seal the class. I'm not settled on what to do there.

Would it make sense to include state inside RawFindData? That would make the signatures and lambdas simpler, especially if state was not used.

dotnet/corefx#25691 explores doing this.

Why is RawFindData named like that? Wouldn't just FindData suffice?

I don't think we would introduce a higher level abstraction, so FindData is probably ok.

If common predicates and transforms are going to be provided, I think their signatures should be the same as what the methods accept

Was trying to encourage leveraging that anonymous method delegates are cached. I'll definitely give that more thought.

If state is considered optional, then it could make sense to provide overloads where you don't have to specify its type

Seems rational at first- I'll try it out

NotDotOrDotDot is fun to say out loud but I think it's unclear what it's supposed to do.

This one is a struggle. Could go with NotJustDotOrDotDot?

@CZEMacLeod
Copy link

@svick @JeremyKuhne
Assume NotDotOrDotDot is a predicate that excludes the current directory and parent directory entries that are returning in directory enumerations e.g. Not . and .. I think the name is fine as is, although it might make more sense in fluent form as @poke suggests? In which case it might be findData.ExceptDotAndDotDot()?
If this is referring to Linux type hidden files/folders starting with a . then the name is confusing.
Looking at your example case, should the predicate include the Not at all?

public static FindEnumerable<string, string> GetFiles(string directory,
    string expression = "*",
    bool recursive = false)
{
    return new FindEnumerable<string, string>(
        directory,
        (ref RawFindData findData, string expr) => FindTransforms.AsFullPath(ref findData),
        (ref RawFindData findData, string expr) =>
        {
            return !FindPredicates.DotOrDotDot(ref findData)
                && !FindPredicates.IsDirectory(ref findData)
                && DosMatcher.MatchPattern(expr, findData.FileName, ignoreCase: true);
        },
        state: DosMatcher.TranslateExpression(expression),
        options: recursive ? FindOptions.Recurse : FindOptions.None);
}

Also, aren't . and .. directories and would be excluded by the IsDirectory predicate anyway?

One other thing is that if RawFindData can contain directory entries for enumerating and filtering directories as well as files, then perhaps FileName is not the right propery name?

@JeremyKuhne
Copy link
Member Author

it might make more sense in fluent form as @poke suggests?

It would make more sense, but would defeat key design goals, notably to minimize allocations. The struct has to be a ref struct to make this possible. If we had dotnet/csharplang#186 it would be worth exploring.

Also, aren't . and .. directories and would be excluded by the IsDirectory predicate anyway?

They are, good catch.

perhaps FileName is not the right propery name?

Perhaps Name would be ok.

@CZEMacLeod
Copy link

@JeremyKuhne Just finished reading dotnet/csharplang#186 so I now understand why this doesn't (currently) have fluent / extension methods.
I must admit I mainly write in vb.net and didn't realise that C# couldn't do ByRef extension methods.

@JeremyKuhne
Copy link
Member Author

@CZEMacLeod It is one of my top wishlist items. That and dotnet/csharplang#187. :)

@CZEMacLeod
Copy link

@JeremyKuhne I notice that some of this comes off the back of MSBuild 'Globbing' (as an aside - is that actually a word?)
Not wanting to take this off-topic at all but I was involved in a discussion on a very similar issue on an asp.net core equivalent of the old asp.net optimization. The feature request was to enable recursively searching for files to bundle using patterns.
It was noted there that this should be able to work over IFileProvider rather that the raw filesystem to allow for the same functionality when using 'virtual' files of one sort or another - e.g. embedded resources, zips, dynamically generated files, etc.
Do you think it might be possible to ensure that this 'concrete' filesystem API can also work this way so we don't need to learn 2 (or more) APIs?
For reference the conversation is Shazwazza/Smidge#47
I believe Nuget also uses (used?) an implementation of IFileProvider in its pack and restore commands to determine files to include - and again includes a lot of pattern matching.
It might be worth considering the scenarios for NuGet and ASP.Net in this spec.

@JeremyKuhne
Copy link
Member Author

'Globbing' (as an aside - is that actually a word?)

It is. https://en.wikipedia.org/wiki/Glob_%28programming%29

Do you think it might be possible to ensure that this 'concrete' filesystem API can also work this way so we don't need to learn 2 (or more) APIs?

I'm open to suggestions on how we could make FindData more flexible without creating too much overhead. Biggest thing is keeping allocations down, but speed is important too. As System.IO currently has a version of this proposal, we can experiment and see real-world impact of design options. I've been measuring impact on >400K file sets on multiple types of drives (SSD, HDD, DriveSpace arrays...).


Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links.

These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address these restrictions and provide performant, configurable enumeration. [](start = 113, length = 83)

You want to make sure that your intro has a concrete proposal. Maybe something like:

This proposes a new set of primitive file and directory traversal APIs that are optimized for providing more flexibility while keeping the overhead to a minimum so that enumeration becomes both more prowerful as well as more performant. #Closed


These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration.

## Scenarios and User Experience
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenarios and User Experience [](start = 3, length = 29)

Your scenarios way too short. You want to make those headings and show some sample code consuming the APIs you're proposing. Scenarios are meant to illustrate the value your APIs are adding. They shouldn't be longer than a few paragraphs but they also shouldn't be that abstract. #Closed

## Requirements

1. Filtering based on common file system data is possible
a. Name
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. [](start = 1, length = 2)

Doesn't render on GitHub as an enumeration. Use numbers or bullet points instead. #Closed

a. Name
b. Attributes
c. Time stamps
d. File size
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d. File size [](start = 3, length = 13)

What doesn't pop here is how the filter an be provided. I think you want to call out explicitly that the filter is a custom function and thus supports all sorts of filtering criteria, including calling custom functions and arbitrary combinators (AND, OR, NOT etc). #Closed

b. Attributes
c. Time stamps
d. File size
2. Result transforms can be of any type
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result transforms can be of any type [](start = 3, length = 36)

I think this needs more details. Basically, you need to call out how this would differ, from, say, regular Linq. #Closed

c. Time stamps
d. File size
2. Result transforms can be of any type
3. We provide common filters and transforms
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide common filters and transforms [](start = 3, length = 40)

I'd explicity list the ones you think are common, such as recursively searching by file name prefix, suffix, and extensions (or whatever the ones are you have in mind) #Closed

d. File size
2. Result transforms can be of any type
3. We provide common filters and transforms
4. Recursive behavior is configurable
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ecursive behavior is configurable [](start = 4, length = 33)

I think you want more detail here. I assume you want ON, OFF, and delegate so that user code can use the criteria listed in (1) decide whether to walk a sub tree. #Closed

2. Result transforms can be of any type
3. We provide common filters and transforms
4. Recursive behavior is configurable
5. Error handling is configurable
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is configurable [](start = 3, length = 30)

Presumably you want to call out that this should be able to avoid throwing & catching exeptions. #Closed

1. MSBuild can custom filter filesystem entries with limited allocations and form the results in any desired format.
2. Users can build custom enumerations utilizing completely custom or provided commonly used filters and transforms.

## Requirements
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is intended to be empty; the requirements you want to have should go under goals and the ones you want to scope out under non-goals. #Closed

### Non-Goals

1. API will not expose platform specific data
3. Error handling configuration is fully customizable
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling configuration is fully customizable [](start = 3, length = 50)

That needs more detail as a goal says Error handling is configurable while a non-goal says Error handling configuration is fully customizable. You need to draw say enough to that the reader can draw a line in their head of what's in and what's out #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it


In reply to: 154807326 [](ancestors = 154807326)

FindOptions options = FindOptions.None);
}

public static class DirectoryInfo
Copy link
Member

@danmoseley danmoseley Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DirectoryInfo is not static of course #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, misstype


In reply to: 154807425 [](ancestors = 154807425)

/// <summary>
/// Delegate for filtering out find results.
/// </summary>
internal delegate bool FindPredicate<TState>(ref RawFindData findData, TState state);
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 4, length = 8)

Presumably you mean public, not internal, right? #Closed

/// <summary>
/// Delegate for transforming raw find data into a result.
/// </summary>
internal delegate TResult FindTransform<TResult, TState>(ref RawFindData findData, TState state);
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 4, length = 8)

Presumably you mean public, not internal, right? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, copy paste fail


In reply to: 154807630 [](ancestors = 154807630)


```

### DosMatcher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you would have a RegexMatcher as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally. We might have to have less than ideal perf to start with if we don't get the Span overloads on Regex at first.


In reply to: 154807693 [](ancestors = 154807693)

}
```

### Samples
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samples [](start = 4, length = 7)

I'd take both the DosMatcher as well as this sample and make it part of a scenario. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


In reply to: 154807983 [](ancestors = 154807983)

Copy link
Member

@terrajobst terrajobst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is a great start. Before merging, I think we need to tweak a few things:

  • Scenarios are too vague
  • Goals and non-goals need to be clarified a bit more

But I love where this is heading!

Getting full path of all files matching a given name pattern (close to what FindFiles does, but returning the full path):

``` C#
public static FindEnumerable<string, string> GetFiles(string directory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample has a heavy cognitive load:

  • Tuples
  • ref
  • Two delegates, likely wrapping other BCL API, which intellisense won't help find
  • concept of passing in a state at the start of an enumeration

It's the ultimate API, but it is likely intimidating for someone who simply wants look for files matching a regex, without caring about every drop of performance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. Doing this doesn't preclude us adding simpler overloads, and I would actually expect to eventually.


In reply to: 154808558 [](ancestors = 154808558)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to propose those here.

@danmoseley
Copy link
Member

@KrzysztofCwalina I'd be interested in your feedback on the API.

/cc @stephentoub who's out.

@svick
Copy link

svick commented Dec 4, 2017

Regarding NotJustDotOrDotDot: I get that . and .. are commonly included in directory listings. But are there actually cases when they are useful when enumerating a directory? If not, NotJustDotOrDotDot could be implicit, making any issues with naming it moot.

@poke
Copy link

poke commented Dec 5, 2017

@svick

@poke An important part of the proposal is limiting recursion. How would you do that using a LINQ-like API?

Not sure what exactly you mean here, but I’m thinking of a IQueryable way of constructing a query first before executing it against the file system. So you would just create a query configuration using a fluent syntax which could then be executed in a very similar way to this API.


@JeremyKuhne

@poke It is a bit complicated due to the need to keep allocations to a minimum.

Is this about allocations while enumerating the files, or would you also want to reduce allocations for the query itself?

It is meant to be an advanced extension point that others can build solutions on for high-impact scenarios that aren't broadly applicable.

I understand that but I still believe that we should try to design an API that will allow others to use it just as well for less critical applications. At least as long as we’re still in an early design phase, we could at least try to evaluate other options here.

Since dotnet/csharplang#186 came up a few times now; what is the time plan for this API? Would waiting for / building upon ref extension methods be an acceptable thing here? That might allow us to make everything a bit simpler.


Anyway, I’m thinking of an implementation like this (rough and unflexible sketch, just to show the direction):

public class FindEnumeratorQuery
{
    private string _directory;
    private string _state;
    private FindOptions _options = FindOptions.None;
    private IList<FindPredicate<string>> _predicates = new List<FindPredicate<string>>();
    private FindTransform<TResult, TState> _transform;

    public FindEnumeratorQuery(string directory, bool recursive = false)
    {
        _directory = directory;
        if (recursive)
            _options |= FindOptions.Recurse;
    }

    public FindEnumeratorQuery MatchPattern(string expression, bool ignoreCase = false)
    {
        _state = DosMatcher.TranslateExpression(expression);
        _predicates.Add((ref RawFindData findData, string expr) => DosMatcher.MatchPattern(expr, findData.FileName, ignoreCase: ignoreCase));
    }
    public FindEnumeratorQuery IsDirectory() { … }
    public FindEnumeratorQuery IsNotDirectory() { … }

    public FindEnumeratorQuery AsFullPath() { … }

    public FindEnumerable<string, string> GetFiles<TResult>()
    {
        return new FindEnumerable<string, string>(_directory, _transform,
            CombinePredicates(_predicates),
            state: _state, options: _options);
    }

    private static FindPredicate<string> CombinePredicates(IEnumerable<FindPredicate<string>> predicates) {…}
}

I believe that, if we accept the allocations in the query construction, we could actually design this in a way that the API becomes more friendly without hurting the performance during the actual enumeration.

@MichalStrehovsky
Copy link
Member

The goal of this is to provide the best performance we can in a cross-platform abstracted way

I don't see the abstraction here; this API basically assumes that the best way to query the filesystem on the underlying platform is with a FindFirstFile-like API.

The approach that @poke is going for (with IQueryable-like fluent syntax) would let the underlying implementation choose the best possible way to get the information user asks for.

The design as proposed here is focusing too much on making just the managed parts fast, forgoing the fact that the underlying platform also needs to do allocations (and I/O, plus potentially networking operations) to satisfy the query. We might be able to get significant savings from limiting that work, but I don't see how the provider could do that, given an API surface that doesn't let the provider know what the user is going for. The time spent doing that extra I/O is orders of magnitude bigger than time spent doing a Gen0 collection of a couple short lived objects (that are per query - not per result).

Even just looking at Windows - with the proposed API, do we know whether we should call FindFirstFile with FIND_FIRST_EX_LARGE_FETCH?

More to be made, pushing what I have to save progress.
@jnm2
Copy link
Contributor

jnm2 commented Dec 5, 2017

Even just looking at Windows - with the proposed API, do we know whether we should call FindFirstFile with FIND_FIRST_EX_LARGE_FETCH?

I have business code calling FindFirstFile with FIND_FIRST_EX_LARGE_FETCH because each of the alternatives I tried was measured to be thousands of times slower for network shares with thousands of files. I hope you either make this the default or provide control over it.

## Requirements


### Goals
Copy link
Member

@terrajobst terrajobst Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goals [](start = 1, length = 8)

The primary goal seems to be:

  • Provide an API shape that is as expressive as the current API shape while also reducing the number of allocations to a minimum. We're willing to compromise some usability, but the complexity should be about the same as many of the other Span<T>-based APIs. #Closed

- Predicate based on FileData
5. Can avoid throwing access denied exceptions

### Non-Goals
Copy link
Member

@terrajobst terrajobst Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-goal to call out is that we don't intend to replace the existing IO APIs -- these are meant to be advanced APIs for folks that really have to care about performance. #Closed

@JeremyKuhne
Copy link
Member Author

@poke

Is this about allocations while enumerating the files, or would you also want to reduce allocations for the query itself?

I believe that, if we accept the allocations in the query construction

It is primarily about reducing string allocations for results we don't care about- which is the primary measured perf problem we're trying to solve. We also, however, want to keep the allocations constrained on construction where possible. Additionally we want to keep performance high. Using anonymous methods gets you caching of the delegates, which is part of why I settled on the current pattern after trying a number of others. String.Create() also lets you at internals, which helped shape this.

@MichalStrehovsky

I don't see the abstraction here; this API basically assumes that the best way to query the filesystem on the underlying platform is with a FindFirstFile-like API.

This API is trying to provide the set of data that goes into making a FileSystemInfo, letting you provide logic that filters out results while pushing string allocations as late as possible. That is the abstraction, and the problem it is meant to solve. The managed allocations are the biggest impact, by far, on measured directory iteration performance in key workloads like MSBuild and VS. Making the change as shown (which is currently in Core) improved our own perf two-fold on large filesets (checking thousands or hundreds of thousands of files). We can and will provide common predicates for both platforms and optimize further on those where possible/warranted by providing platform specific implementations that get at more internals.

The design as proposed here is focusing too much on making just the managed parts fast

That is what this proposal is. It isn't to provide direct access to the OS APIs- it's to provide access to the internals of our iterator. I don't see a way to extend the providing of RawFindData on the iterator without making this way more complicated or defeating the allocation reductions goal.

The time spent doing that extra I/O

What extra I/O? There is no way to filter out the disk access for a directory enumeration. The file system has to get all of the data to make a decision. How many extra allocations you have up the stack is what makes all of the difference. I removed a fundamental block on Windows by simply not calling FindFirstFile. It has to copy the NT structure into a Win32 structure and doesn't have context for a "session". Additionally, it allocates a new 4K buffer for every single directory you access.

@jnm2

I have business code calling FindFirstFile with FIND_FIRST_EX_LARGE_FETCH because each of the alternatives I tried was measured to be thousands of times slower for network shares with thousands of files.

I've skipped FindFirstFile entirely and use our own 4K buffer. I saw no impact on local searches when I played with various sizes here. I'm happy to make the buffer size configurable and would love to get some real world data on the impact of buffer sizes on shares. If you have the opportunity to play with tweaking what we have please do. I'll try and experiment some here as well.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/FindEnumerable.Windows.cs#L114

We could provide a single flag option for UseLargeBuffer or something similar. Again, data to inform that would be fantastic.

@JeremyKuhne
Copy link
Member Author

@davidwrighton If you have a chance to look at this I'd appreciate it. I'm also interested in your thoughts on the impact of trying to put an interface on the data struct and passing the interface through the delegates rather than the struct itself (for testability, etc.).


To get full paths for files, one would currently have to do the following:

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add charp to get syntax highlighting

Lay out a potential variant for FindData to allow unit testing.

### Potential testability variant

Having `FindData` be a struct is critical to meet the performance goals for the feature. To allow testing of Filters and Predicates we could also expose an interface of `IFindData`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably discuss in the API review in more detail, but I feel I'm not sure this minor tweak is worth the complexity. It seems in order to support testing, we'd need to allow passing something around that can serve to produce FindData<T> instances. In other words, for testing it seems just being able to to query a single find data is very limited, compared to an IFileSystem (whatever that would look like).

Also add some more details for future thoughts.
@migueldeicaza
Copy link

My comments:

I find this new API difficult to read, I am hoping that this can live in a parallel universe and not in the core of the .NET frameworks.

  • Just how much memory is being allocated today in these scenarios that you are looking for, and what are the savings with this design?
  • Has someone looked into rewriting the alloc-happy code in the framework to avoid this, like Mono used to?

The examples shown also rely on APIs that require FileInfos to be created, and are themselves not a pinnacle of low-allocation work, starting with the IEnumerable right there, and the use of LINQ ☺

The first one, can be rewritten without allocating FileInfos, like this:

foreach (var t in Directory.GetFiles ("/tmp")) Console.WriteLine (Path.GetFullPath (t))

The second example, filtering for extensions that is given is a poor implementation, if you want to match on extensions, you can let Directory do the work:

foreach (var t in Directory.GetFiles (“*.ext”)) Console.WriteLine (Path.GetFullPath (t))

The last one addresses all three items from the list of pros on that document.

I checked both of those, like this, and not a single FileInfo was created:

mac$ MONO_ENV_OPTIONS='--trace=T:System.IO.FileInfo' csharp -e 'foreach (var t in System.IO.Directory.GetFiles ("/etc", "*.conf")) Console.WriteLine (System.IO.Path.GetFullPath (t))' | grep FileInfo
mac$ MONO_ENV_OPTIONS='--trace=T:System.IO.FileInfo' csharp -e 'foreach (var t in System.IO.Directory.GetFiles ("/etc", "*")) Console.WriteLine (System.IO.Path.GetFullPath (t))' | grep FileInfo

As the doc points out, recursive directory scanning will be more expensive on Unix, no matter what you do, as you need to probe each file listed that you get back from the OS. But if we move the scanning to unmanaged code, no FileInfos will ever be produced, nor stats surfaced on Unix.

I just looked at the implementation in CoreFX (which we just took for Mono) and it looks allocation happy by default, something that the original Mono version did not do. Mono used to do the work in unmanaged code, and surfaced a properly-sized array, while the new implementation creates a List<T> to scan, and then does a ToArray() at the end.

Some questions:

@danmoseley
Copy link
Member

@migueldeicaza did the questions you mention get cut off? Great feedback BTW

@JeremyKuhne
Copy link
Member Author

I find this new API difficult to read, I am hoping that this can live in a parallel universe and not in the core of the .NET frameworks.

We won't be getting rid of the existing APIs. This is not intended for general day-to-day use- we'll keep the "simple" APIs as the primary entry points. It's meant to allow advanced users such as MSBuild to do custom filtering, caching, and transforms without writing their own native implementations.

The first one, can be rewritten without allocating FileInfos, like this

Yes, but you trade in extra string allocations and an expensive normalization call for every entry. Same for the second example. I kept these initial two relatively simple to illustrate usage. I also did not fully optimize the third example to keep it conceptually simpler (e.g. I wouldn’t use Linq in reality). I’ll add some comments.

I just looked at the implementation in CoreFX (which we just took for Mono) and it looks allocation happy by default

For the CoreFX implementation are you looking at what is in master? Master was changed recently to use an internal implementation that looks like this proposal (for Windows). Even with the changes, our APIs that return arrays currently use IEnumerable internally. I've opened an issue about potentially optimizing those scenarios. (https://github.com/dotnet/corefx/issues/25863)

@JeremyKuhne
Copy link
Member Author

Note: This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic.

@JeremyKuhne JeremyKuhne merged commit a279afe into dotnet:master Dec 13, 2017
@JeremyKuhne JeremyKuhne deleted the enumeration branch December 13, 2017 00:51
@terrajobst
Copy link
Member

@migueldeicaza

By the way, merging doesn't mean we're ignoring your feedback. It just means we've accepted the business goal and are committed to ship the feature. The design will likely evolve, especially after an API review. So the discussion is hardly over!

@powercode
Copy link

PowerShell, specifically the FileSystemProvider, would be a customer for this to.

PowerShell today has perf issues with enumerating files recursively and getting the Mode and LinkType synthetic properties.

It would be good to make sure that these new APIs will work well with that core scenario. @SteveL-MSFT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants