-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of file enumeration design doc. #24
Conversation
I have to say that the API looks quite complicated. The fact that your sample is basically building a helper method on top of the proposed API kind of shows that consuming it might be too difficult and it is maybe not meant as a high-level way to enumerate the file system after all. I fear that people will just end up using the existing methods which they can then filter using comfortable LINQ instead of dealing with this new API. And I certainly don’t believe this new API should be limited to those that care about its benefits—ideally, everybody should be able to use it without having to invest time to learn completely new API styles. When I saw the Wouldn’t we be able to convert this to a fluid LINQ-like API, so we could do something like |
My comments:
@poke An important part of the proposal is limiting recursion. How would you do that using a LINQ-like API? |
@poke It is a bit complicated due to the need to keep allocations to a minimum. It is meant to be an advanced extension point that others can build solutions on for high-impact scenarios that aren't broadly applicable. Think MSBuild trying to find all files with a set of extensions under a certain folder tree as one example.
Thanks, I'll fix them. This started from our current internal Windows implementation. :)
We don't need to have both if we seal the class. I'm not settled on what to do there.
dotnet/corefx#25691 explores doing this.
I don't think we would introduce a higher level abstraction, so FindData is probably ok.
Was trying to encourage leveraging that anonymous method delegates are cached. I'll definitely give that more thought.
Seems rational at first- I'll try it out
This one is a struggle. Could go with |
@svick @JeremyKuhne
Also, aren't . and .. directories and would be excluded by the IsDirectory predicate anyway? One other thing is that if |
It would make more sense, but would defeat key design goals, notably to minimize allocations. The struct has to be a ref struct to make this possible. If we had dotnet/csharplang#186 it would be worth exploring.
They are, good catch.
Perhaps |
@JeremyKuhne Just finished reading dotnet/csharplang#186 so I now understand why this doesn't (currently) have fluent / extension methods. |
@CZEMacLeod It is one of my top wishlist items. That and dotnet/csharplang#187. :) |
@JeremyKuhne I notice that some of this comes off the back of MSBuild 'Globbing' (as an aside - is that actually a word?) |
It is. https://en.wikipedia.org/wiki/Glob_%28programming%29
I'm open to suggestions on how we could make FindData more flexible without creating too much overhead. Biggest thing is keeping allocations down, but speed is important too. As System.IO currently has a version of this proposal, we can experiment and see real-world impact of design options. I've been measuring impact on >400K file sets on multiple types of drives (SSD, HDD, DriveSpace arrays...). |
accepted/file-enumeration.md
Outdated
|
||
Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links. | ||
|
||
These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address these restrictions and provide performant, configurable enumeration. [](start = 113, length = 83)
You want to make sure that your intro has a concrete proposal. Maybe something like:
This proposes a new set of primitive file and directory traversal APIs that are optimized for providing more flexibility while keeping the overhead to a minimum so that enumeration becomes both more prowerful as well as more performant. #Closed
accepted/file-enumeration.md
Outdated
|
||
These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. We can address these restrictions and provide performant, configurable enumeration. | ||
|
||
## Scenarios and User Experience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenarios and User Experience [](start = 3, length = 29)
Your scenarios way too short. You want to make those headings and show some sample code consuming the APIs you're proposing. Scenarios are meant to illustrate the value your APIs are adding. They shouldn't be longer than a few paragraphs but they also shouldn't be that abstract. #Closed
accepted/file-enumeration.md
Outdated
## Requirements | ||
|
||
1. Filtering based on common file system data is possible | ||
a. Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. [](start = 1, length = 2)
Doesn't render on GitHub as an enumeration. Use numbers or bullet points instead. #Closed
accepted/file-enumeration.md
Outdated
a. Name | ||
b. Attributes | ||
c. Time stamps | ||
d. File size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d. File size [](start = 3, length = 13)
What doesn't pop here is how the filter an be provided. I think you want to call out explicitly that the filter is a custom function and thus supports all sorts of filtering criteria, including calling custom functions and arbitrary combinators (AND, OR, NOT etc). #Closed
accepted/file-enumeration.md
Outdated
b. Attributes | ||
c. Time stamps | ||
d. File size | ||
2. Result transforms can be of any type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result transforms can be of any type [](start = 3, length = 36)
I think this needs more details. Basically, you need to call out how this would differ, from, say, regular Linq. #Closed
accepted/file-enumeration.md
Outdated
c. Time stamps | ||
d. File size | ||
2. Result transforms can be of any type | ||
3. We provide common filters and transforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We provide common filters and transforms [](start = 3, length = 40)
I'd explicity list the ones you think are common, such as recursively searching by file name prefix, suffix, and extensions (or whatever the ones are you have in mind) #Closed
accepted/file-enumeration.md
Outdated
d. File size | ||
2. Result transforms can be of any type | ||
3. We provide common filters and transforms | ||
4. Recursive behavior is configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ecursive behavior is configurable [](start = 4, length = 33)
I think you want more detail here. I assume you want ON, OFF, and delegate so that user code can use the criteria listed in (1) decide whether to walk a sub tree. #Closed
accepted/file-enumeration.md
Outdated
2. Result transforms can be of any type | ||
3. We provide common filters and transforms | ||
4. Recursive behavior is configurable | ||
5. Error handling is configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling is configurable [](start = 3, length = 30)
Presumably you want to call out that this should be able to avoid throwing & catching exeptions. #Closed
accepted/file-enumeration.md
Outdated
1. MSBuild can custom filter filesystem entries with limited allocations and form the results in any desired format. | ||
2. Users can build custom enumerations utilizing completely custom or provided commonly used filters and transforms. | ||
|
||
## Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is intended to be empty; the requirements you want to have should go under goals and the ones you want to scope out under non-goals. #Closed
accepted/file-enumeration.md
Outdated
### Non-Goals | ||
|
||
1. API will not expose platform specific data | ||
3. Error handling configuration is fully customizable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling configuration is fully customizable [](start = 3, length = 50)
That needs more detail as a goal says Error handling is configurable while a non-goal says Error handling configuration is fully customizable. You need to draw say enough to that the reader can draw a line in their head of what's in and what's out #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted/file-enumeration.md
Outdated
FindOptions options = FindOptions.None); | ||
} | ||
|
||
public static class DirectoryInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DirectoryInfo is not static of course #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted/file-enumeration.md
Outdated
/// <summary> | ||
/// Delegate for filtering out find results. | ||
/// </summary> | ||
internal delegate bool FindPredicate<TState>(ref RawFindData findData, TState state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 4, length = 8)
Presumably you mean public
, not internal
, right? #Closed
accepted/file-enumeration.md
Outdated
/// <summary> | ||
/// Delegate for transforming raw find data into a result. | ||
/// </summary> | ||
internal delegate TResult FindTransform<TResult, TState>(ref RawFindData findData, TState state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 4, length = 8)
Presumably you mean public
, not internal
, right? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
``` | ||
|
||
### DosMatcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you would have a RegexMatcher as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally. We might have to have less than ideal perf to start with if we don't get the Span overloads on Regex at first.
In reply to: 154807693 [](ancestors = 154807693)
accepted/file-enumeration.md
Outdated
} | ||
``` | ||
|
||
### Samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samples [](start = 4, length = 7)
I'd take both the DosMatcher
as well as this sample and make it part of a scenario. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is a great start. Before merging, I think we need to tweak a few things:
- Scenarios are too vague
- Goals and non-goals need to be clarified a bit more
But I love where this is heading!
accepted/file-enumeration.md
Outdated
Getting full path of all files matching a given name pattern (close to what FindFiles does, but returning the full path): | ||
|
||
``` C# | ||
public static FindEnumerable<string, string> GetFiles(string directory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sample has a heavy cognitive load:
- Tuples
- ref
- Two delegates, likely wrapping other BCL API, which intellisense won't help find
- concept of passing in a state at the start of an enumeration
It's the ultimate API, but it is likely intimidating for someone who simply wants look for files matching a regex, without caring about every drop of performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. Doing this doesn't preclude us adding simpler overloads, and I would actually expect to eventually.
In reply to: 154808558 [](ancestors = 154808558)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to propose those here.
@KrzysztofCwalina I'd be interested in your feedback on the API. /cc @stephentoub who's out. |
Regarding |
Not sure what exactly you mean here, but I’m thinking of a
Is this about allocations while enumerating the files, or would you also want to reduce allocations for the query itself?
I understand that but I still believe that we should try to design an API that will allow others to use it just as well for less critical applications. At least as long as we’re still in an early design phase, we could at least try to evaluate other options here. Since dotnet/csharplang#186 came up a few times now; what is the time plan for this API? Would waiting for / building upon ref extension methods be an acceptable thing here? That might allow us to make everything a bit simpler. Anyway, I’m thinking of an implementation like this (rough and unflexible sketch, just to show the direction):
I believe that, if we accept the allocations in the query construction, we could actually design this in a way that the API becomes more friendly without hurting the performance during the actual enumeration. |
I don't see the abstraction here; this API basically assumes that the best way to query the filesystem on the underlying platform is with a FindFirstFile-like API. The approach that @poke is going for (with The design as proposed here is focusing too much on making just the managed parts fast, forgoing the fact that the underlying platform also needs to do allocations (and I/O, plus potentially networking operations) to satisfy the query. We might be able to get significant savings from limiting that work, but I don't see how the provider could do that, given an API surface that doesn't let the provider know what the user is going for. The time spent doing that extra I/O is orders of magnitude bigger than time spent doing a Gen0 collection of a couple short lived objects (that are per query - not per result). Even just looking at Windows - with the proposed API, do we know whether we should call FindFirstFile with FIND_FIRST_EX_LARGE_FETCH? |
More to be made, pushing what I have to save progress.
I have business code calling |
## Requirements | ||
|
||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goals [](start = 1, length = 8)
The primary goal seems to be:
- Provide an API shape that is as expressive as the current API shape while also reducing the number of allocations to a minimum. We're willing to compromise some usability, but the complexity should be about the same as many of the other
Span<T>
-based APIs. #Closed
accepted/file-enumeration.md
Outdated
- Predicate based on FileData | ||
5. Can avoid throwing access denied exceptions | ||
|
||
### Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One non-goal to call out is that we don't intend to replace the existing IO APIs -- these are meant to be advanced APIs for folks that really have to care about performance. #Closed
It is primarily about reducing string allocations for results we don't care about- which is the primary measured perf problem we're trying to solve. We also, however, want to keep the allocations constrained on construction where possible. Additionally we want to keep performance high. Using anonymous methods gets you caching of the delegates, which is part of why I settled on the current pattern after trying a number of others.
This API is trying to provide the set of data that goes into making a
That is what this proposal is. It isn't to provide direct access to the OS APIs- it's to provide access to the internals of our iterator. I don't see a way to extend the providing of
What extra I/O? There is no way to filter out the disk access for a directory enumeration. The file system has to get all of the data to make a decision. How many extra allocations you have up the stack is what makes all of the difference. I removed a fundamental block on Windows by simply not calling FindFirstFile. It has to copy the NT structure into a Win32 structure and doesn't have context for a "session". Additionally, it allocates a new 4K buffer for every single directory you access.
I've skipped FindFirstFile entirely and use our own 4K buffer. I saw no impact on local searches when I played with various sizes here. I'm happy to make the buffer size configurable and would love to get some real world data on the impact of buffer sizes on shares. If you have the opportunity to play with tweaking what we have please do. I'll try and experiment some here as well. We could provide a single flag option for |
Clean up flag goals a bit.
@davidwrighton If you have a chance to look at this I'd appreciate it. I'm also interested in your thoughts on the impact of trying to put an interface on the data struct and passing the interface through the delegates rather than the struct itself (for testability, etc.). |
accepted/file-enumeration.md
Outdated
|
||
To get full paths for files, one would currently have to do the following: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add charp
to get syntax highlighting
Lay out a potential variant for FindData to allow unit testing.
accepted/file-enumeration.md
Outdated
|
||
### Potential testability variant | ||
|
||
Having `FindData` be a struct is critical to meet the performance goals for the feature. To allow testing of Filters and Predicates we could also expose an interface of `IFindData`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably discuss in the API review in more detail, but I feel I'm not sure this minor tweak is worth the complexity. It seems in order to support testing, we'd need to allow passing something around that can serve to produce FindData<T>
instances. In other words, for testing it seems just being able to to query a single find data is very limited, compared to an IFileSystem
(whatever that would look like).
Also add some more details for future thoughts.
My comments: I find this new API difficult to read, I am hoping that this can live in a parallel universe and not in the core of the .NET frameworks.
The examples shown also rely on APIs that require FileInfos to be created, and are themselves not a pinnacle of low-allocation work, starting with the IEnumerable right there, and the use of LINQ ☺ The first one, can be rewritten without allocating FileInfos, like this:
The second example, filtering for extensions that is given is a poor implementation, if you want to match on extensions, you can let Directory do the work:
The last one addresses all three items from the list of pros on that document. I checked both of those, like this, and not a single FileInfo was created:
As the doc points out, recursive directory scanning will be more expensive on Unix, no matter what you do, as you need to probe each file listed that you get back from the OS. But if we move the scanning to unmanaged code, no FileInfos will ever be produced, nor stats surfaced on Unix. I just looked at the implementation in CoreFX (which we just took for Mono) and it looks allocation happy by default, something that the original Mono version did not do. Mono used to do the work in unmanaged code, and surfaced a properly-sized array, while the new implementation creates a Some questions: |
@migueldeicaza did the questions you mention get cut off? Great feedback BTW |
We won't be getting rid of the existing APIs. This is not intended for general day-to-day use- we'll keep the "simple" APIs as the primary entry points. It's meant to allow advanced users such as MSBuild to do custom filtering, caching, and transforms without writing their own native implementations.
Yes, but you trade in extra string allocations and an expensive normalization call for every entry. Same for the second example. I kept these initial two relatively simple to illustrate usage. I also did not fully optimize the third example to keep it conceptually simpler (e.g. I wouldn’t use Linq in reality). I’ll add some comments.
For the CoreFX implementation are you looking at what is in master? Master was changed recently to use an internal implementation that looks like this proposal (for Windows). Even with the changes, our APIs that return arrays currently use IEnumerable internally. I've opened an issue about potentially optimizing those scenarios. (https://github.com/dotnet/corefx/issues/25863) |
Note: This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic. |
By the way, merging doesn't mean we're ignoring your feedback. It just means we've accepted the business goal and are committed to ship the feature. The design will likely evolve, especially after an API review. So the discussion is hardly over! |
PowerShell, specifically the FileSystemProvider, would be a customer for this to. PowerShell today has perf issues with enumerating files recursively and getting the It would be good to make sure that these new APIs will work well with that core scenario. @SteveL-MSFT. |
This is a proposal to provide a performant extensibility point for file enumeration in System.IO.
This is based on the existing internal Windows implementation, see the CoreFX master branch for some portion of this in action. A branch implemented to spec (for Windows only) can be found here: https://github.com/JeremyKuhne/corefx/tree/findpublic
@terrajobst, @pjanotti, @danmosemsft