Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of file enumeration design doc. #24

Merged
merged 10 commits into from
Dec 13, 2017
248 changes: 248 additions & 0 deletions accepted/file-enumeration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Extensible File Enumeration

**PM** [Immo Landwerth](https://github.com/terrajobst) | **Dev** [Jeremy Kuhne](https://github.com/jeremykuhne)

Enumerating files in .NET provides limited configurability. You can specify a simple DOS style pattern and whether or not to look recursively. More complicated filtering requires post filtering all results which can introduce a significant performance drain.

Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links.

These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. This document proposes a new set of primitive file and directory traversal APIs that are optimized for providing more flexibility while keeping the overhead to a minimum so that enumeration becomes both more powerful as well as more performant.

## Scenarios and User Experience
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenarios and User Experience [](start = 3, length = 29)

Your scenarios way too short. You want to make those headings and show some sample code consuming the APIs you're proposing. Scenarios are meant to illustrate the value your APIs are adding. They shouldn't be longer than a few paragraphs but they also shouldn't be that abstract. #Closed


1. MSBuild can custom filter filesystem entries with limited allocations and form the results in any desired format.
2. Users can build custom enumerations utilizing completely custom or provided commonly used filters and transforms.

## Requirements
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is intended to be empty; the requirements you want to have should go under goals and the ones you want to scope out under non-goals. #Closed



### Goals
Copy link
Member

@terrajobst terrajobst Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goals [](start = 1, length = 8)

The primary goal seems to be:

  • Provide an API shape that is as expressive as the current API shape while also reducing the number of allocations to a minimum. We're willing to compromise some usability, but the complexity should be about the same as many of the other Span<T>-based APIs. #Closed



1. Custom filtering based on common file system data
- Name
- Attributes
- Time stamps
- File size
2. Result transforms can be of any desired output type
- Like Linq Select(), but keeps FileData on the stack
3. API minimizes allocations
4. API is cross platform abstract
3. We provide common filters and transforms
- To file/directory name
- To full path
- To File/Directory/FileSystemInfo
- DOS style filters (Legacy- `*/?` with DOS semantics, e.g. `*.` is all files without an extension)
- Simple Regex filter
- Simpler globbing (`*/?` without DOS style variants)
- Set of extensions (`*.c`, `*.cpp`, `*.cc`, `*.cxx`, etc.)
4. Recursive behavior is configurable
- On/Off
- Predicate based on FileData
5. Can avoid throwing access denied exceptions

### Non-Goals
Copy link
Member

@terrajobst terrajobst Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-goal to call out is that we don't intend to replace the existing IO APIs -- these are meant to be advanced APIs for folks that really have to care about performance. #Closed


1. API will not expose platform specific data
3. Error handling configuration is fully customizable
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling configuration is fully customizable [](start = 3, length = 50)

That needs more detail as a goal says Error handling is configurable while a non-goal says Error handling configuration is fully customizable. You need to draw say enough to that the reader can draw a line in their head of what's in and what's out #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it


In reply to: 154807326 [](ancestors = 154807326)


## Design

### Proposed API surface

``` C#
namespace System.IO
{
/// <summary>
/// Delegate for filtering out find results.
/// </summary>
public delegate bool FindPredicate<TState>(ref RawFindData findData, TState state);

/// <summary>
/// Delegate for transforming raw find data into a result.
/// </summary>
public delegate TResult FindTransform<TResult, TState>(ref RawFindData findData, TState state);

[Flags]
public enum FindOptions
{
None = 0x0,

// Enumerate subdirectories
Recurse = 0x1,

// Skip files/directories when access is denied
IgnoreAccessDenied = 0x2,

// Future: Add flags for tracking cycles, etc.
}

public class FindEnumerable<TResult, TState> : CriticalFinalizerObject, IEnumerable<TResult>, IEnumerator<TResult>
{
public FindEnumerable(
string directory,
FindTransform<TResult, TState> transform,
FindPredicate<TState> predicate,
// Only used if FindOptions.Recurse is set. Default is to always recurse.
FindPredicate<TState> recursePredicate = null,
TState state = default,
FindOptions options = FindOptions.None)
}

public static class Directory
{
public static IEnumerable<TResult> Enumerate<TResult, TState>(
string path,
FindTransform<TResult, TState> transform,
FindPredicate<TState> predicate,
FindPredicate<TState> recursePredicate = null,
TState state = default,
FindOptions options = FindOptions.None);
}

public class DirectoryInfo
{
public static IEnumerable<TResult> Enumerate<TResult, TState>(
FindTransform<TResult, TState> transform,
FindPredicate<TState> predicate,
FindPredicate<TState> recursePredicate = null,
TState state = default,
FindOptions options = FindOptions.None);
}

/// <summary>
/// Used for processing and filtering find results.
/// </summary>
public ref struct RawFindData
{
// This will have private members that hold the native data and
// will lazily fill in data for properties where such data is not
// immediately available in the current platform's native results.

// The full path to the directory the current result is in
public string Directory { get; }

// The full path to the starting directory for enumeration
public string OriginalDirectory { get; }

// The path to the starting directory as passed to the enumerable constructor
public string OriginalUserDirectory { get; }

// Note: using a span allows us to reduce unneeded allocations
public ReadOnlySpan<char> FileName { get; }
public FileAttributes Attributes { get; }
public long Length { get; }

public DateTime CreationTimeUtc { get; }
public DateTime LastAccessTimeUtc { get; }
public DateTime LastWriteTimeUtc { get; }
}
}
```

### Transforms & Predicates

We'll provide common predicates transforms for building searches.

``` C#
namespace System.IO
{
internal static partial class FindPredicates
{
internal static bool NotDotOrDotDot(ref RawFindData findData)
internal static bool IsDirectory(ref RawFindData findData)
}

public static partial class FindTransforms
{
public static DirectoryInfo AsDirectoryInfo(ref RawFindData findData)
public static FileInfo AsFileInfo(ref RawFindData findData)
public static FileSystemInfo AsFileSystemInfo(ref RawFindData findData)
public static string AsFileName(ref RawFindData findData)
public static string AsFullPath(ref RawFindData findData)
}
}

```

### DosMatcher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you would have a RegexMatcher as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally. We might have to have less than ideal perf to start with if we don't get the Span overloads on Regex at first.


In reply to: 154807693 [](ancestors = 154807693)


We currently have an implementation of the algorithm used for matching files on Windows in FileSystemWatcher. Providing this publicly will allow consistently matching names cross platform according to Windows rules if such behavior is desired.

``` C#
namespace System.IO
{
public static class DosMatcher
{
/// <summary>
/// Change '*' and '?' to '&lt;', '&gt;' and '"' to match Win32 behavior. For compatibility, Windows
/// changes some wildcards to provide a closer match to historical DOS 8.3 filename matching.
/// </summary>
public unsafe static string TranslateExpression(string expression)

/// <summary>
/// Return true if the given expression matches the given name.
/// </summary>
public unsafe static bool MatchPattern(string expression, ReadOnlySpan<char> name, bool ignoreCase = true)
}
}
```

### Samples
Copy link
Member

@terrajobst terrajobst Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samples [](start = 4, length = 7)

I'd take both the DosMatcher as well as this sample and make it part of a scenario. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


In reply to: 154807983 [](ancestors = 154807983)


Getting full path of all files matching a given name pattern (close to what FindFiles does, but returning the full path):

``` C#
public static FindEnumerable<string, string> GetFiles(string directory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample has a heavy cognitive load:

  • Tuples
  • ref
  • Two delegates, likely wrapping other BCL API, which intellisense won't help find
  • concept of passing in a state at the start of an enumeration

It's the ultimate API, but it is likely intimidating for someone who simply wants look for files matching a regex, without caring about every drop of performance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. Doing this doesn't preclude us adding simpler overloads, and I would actually expect to eventually.


In reply to: 154808558 [](ancestors = 154808558)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to propose those here.

string expression = "*",
bool recursive = false)
{
return new FindEnumerable<string, string>(
directory,
(ref RawFindData findData, string expr) => FindTransforms.AsFullPath(ref findData),
(ref RawFindData findData, string expr) =>
{
return !FindPredicates.IsDirectory(ref findData)
&& DosMatcher.MatchPattern(expr, findData.FileName, ignoreCase: true);
},
state: DosMatcher.TranslateExpression(expression),
options: recursive ? FindOptions.Recurse : FindOptions.None);
}

```

### Existing API summary

``` C#
namespace System.IO
{
public static class Directory
{
public static IEnumerable<string> EnumerateDirectories(string path, string searchPattern, SearchOption searchOption);
public static IEnumerable<string> EnumerateFiles(string path, string searchPattern, SearchOption searchOption);
public static IEnumerable<string> EnumerateFileSystemEntries(string path, string searchPattern, SearchOption searchOption);
public static string[] GetDirectories(string path, string searchPattern, SearchOption searchOption);
public static string[] GetFiles(string path, string searchPattern, SearchOption searchOption);
public static string[] GetFileSystemEntries(string path, string searchPattern, SearchOption searchOption);
}

public sealed class DirectoryInfo : FileSystemInfo
{
public IEnumerable<DirectoryInfo> EnumerateDirectories(string searchPattern, SearchOption searchOption);
public IEnumerable<FileInfo> EnumerateFiles(string searchPattern, SearchOption searchOption);
public IEnumerable<FileSystemInfo> EnumerateFileSystemInfos(string searchPattern, SearchOption searchOption);
public DirectoryInfo[] GetDirectories(string searchPattern, SearchOption searchOption);
public FileInfo[] GetFiles(string searchPattern, SearchOption searchOption);
public FileSystemInfo[] GetFileSystemInfos(string searchPattern, SearchOption searchOption);
}

public enum SearchOption
{
AllDirectories,
TopDirectoryOnly
}
}
```


## Q & A