Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate directory enumeration performance #8396

Closed
KirillOsenkov opened this issue Feb 3, 2023 · 10 comments
Closed

Investigate directory enumeration performance #8396

KirillOsenkov opened this issue Feb 3, 2023 · 10 comments
Labels

Comments

@KirillOsenkov
Copy link
Member

I noticed we're using standard Directory.EnumerateFiles() to enumerate files for globs. It's not very efficient, and also runs the risks of throwing when it hits directories or files it can't access.

Sample first-chance exception:

System.UnauthorizedAccessException: Access to the path 'C:\Documents and Settings' is denied.
   at void System.IO.__Error.WinIOError(int errorCode, string maybeFullPath)
   at void System.IO.FileSystemEnumerableIterator<TSource>.CommonInit()
   at new System.IO.FileSystemEnumerableIterator<TSource>(string path, string originalUserPath, string searchPattern, SearchOption searchOption, SearchResultHandler<TSource> resultHandler, bool checkHost)
   at IEnumerable<string> System.IO.Directory.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.ManagedFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.MSBuildOnWindowsFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.CachingFileSystemWrapper.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFiles(IFileSystem fileSystem, string path, string filespec, string projectDirectory, bool stripProjectDirectory)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFileSystemEntries(IFileSystem fileSystem, FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory) => { } x 2
   at TValue System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue>.GetOrAdd(TKey key, Func<TKey, TValue> valueFactory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity type, string path, string pattern, string directory, bool stripProjectDirectory) => { }
   at IEnumerable<string> Microsoft.Build.Shared.FileMatcher.GetFilesForStep(RecursiveStepResult stepResult, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)+(string subdir) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForEachWorker<TSource, TLocal>(IEnumerable<TSource> source, ParallelOptions parallelOptions, Action<TSource> body, Action<TSource, ParallelLoopState> bodyWithState, Action<TSource, ParallelLoopState, long> bodyWithStateAndIndex, Func<TSource, ParallelLoopState, TLocal, TLocal> bodyWithStateAndLocal, Func<TSource, ParallelLoopState, long, TLocal, TLocal> bodyWithEverything, Func<TLocal> localInit, Action<TLocal> localFinally)+(int i) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForWorker<TLocal>(int fromInclusive, int toExclusive, ParallelOptions parallelOptions, Action<int> body, Action<int, ParallelLoopState> bodyWithState, Func<int, ParallelLoopState, TLocal, TLocal> bodyWithLocal, Func<TLocal> localInit, Action<TLocal> localFinally)+() => { }
   at void System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at void System.Threading.Tasks.Task.ExecuteSelfReplicating(Task root)+() => { }
   at void System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at bool System.Threading.ThreadPoolWorkQueue.Dispatch()
   at bool System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

Also seeing the same for C:\Config.Msi when we accidentally enumerate the whole drive due to some property being empty and the glob ends up starting with a \.

I've had success with directly calling the Win32 API in parallel to reduce allocations, achieving up to 2x speed and 0.5x allocations:
https://github.com/KirillOsenkov/Benchmarks/blob/8556f92c07b9a3d211a7e72b776c324aff7e24b7/src/Tests/DirectoryEnumeration.cs#L12-L15

Also it seems that this approach doesn't run into exceptions when trying to access inaccessible directories, unlike the BCL one.

Feel free to experiment with this benchmark, steal the source, try on real-world builds, see if you can tune it further, submit PRs if you can make it even faster ;)

The first place I would try this is in FileMatcher (see the stack above). Also, looking at the stack, I'd measure getting rid of the ConcurrentDictionary and try a simple collection with a lock around it. I often get much better results with a simple lock around simple collections.

I'm noticing we do have a ManagedFileSystem abstraction, so I guess we can try replacing the implementation in a single place and see if it can make our builds faster wholesale.

One potential concern is that the parallelism in the new method does a lot of thrashing, so not sure how this performs on an HDD. But then again, do we care about HDDs anymore?

@KirillOsenkov
Copy link
Member Author

@davkean points out there's a high perf directory enumeration logic here:
https://www.fuget.org/packages/Microsoft.IO.Redist/6.0.0/lib/net472/Microsoft.IO.Redist.dll/Microsoft.IO.Enumeration/FileSystemEnumerator%601

@danmoseley
Copy link
Member

@JeremyKuhne

@JeremyKuhne
Copy link
Member

@KirillOsenkov feel free to hit me up to talk about this. I specifically had MSBuild in mind when I wrote this and provided a .NET Framework build.

@davkean
Copy link
Member

davkean commented Feb 3, 2023

@JeremyKuhne I told him the same thing internally :)

@rainersigwald
Copy link
Member

As @ladipro pointed out internally we should be using it on at least some codepaths: #6771. And indeed I do see Microsoft.IO.Redist used for a trivial <I Include="C:\**\*.txt" />--though it does throw the first-chance exception mentioned.

System.UnauthorizedAccessException
  HResult=0x80070005
  Message=Access to the path 'C:\Config.Msi' is denied.
  Source=Microsoft.IO.Redist
  StackTrace:
   at Microsoft.IO.Enumeration.FileSystemEnumerator`1.CreateDirectoryHandle(String path, Boolean ignoreNotFound) in /_/src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerator.Windows.cs:line 196
 	Microsoft.IO.Redist.dll!Microsoft.IO.Enumeration.FileSystemEnumerator<System.__Canon>.CreateDirectoryHandle(string path, bool ignoreNotFound) Line 193	C#
 	Microsoft.IO.Redist.dll!Microsoft.IO.Enumeration.FileSystemEnumerator<System.__Canon>.Init() Line 50	C#
 	Microsoft.IO.Redist.dll!Microsoft.IO.Enumeration.FileSystemEnumerable<string>.FileSystemEnumerable(string directory, Microsoft.IO.Enumeration.FileSystemEnumerable<string>.FindTransform transform, Microsoft.IO.EnumerationOptions options, bool isNormalized) Line 38	C#
 	Microsoft.IO.Redist.dll!Microsoft.IO.Enumeration.FileSystemEnumerableFactory.UserFiles(string directory, string expression, Microsoft.IO.EnumerationOptions options) Line 124	C#
 	Microsoft.IO.Redist.dll!Microsoft.IO.Directory.InternalEnumeratePaths(string path, string searchPattern, Microsoft.IO.SearchTarget searchTarget, Microsoft.IO.EnumerationOptions options) Line 181	C#
>	Microsoft.Build.dll!Microsoft.Build.Shared.FileSystem.ManagedFileSystem.EnumerateFiles.AnonymousMethod__10_0(string path, string searchPattern, Microsoft.IO.SearchOption searchOption) Line 88	C#

@KirillOsenkov can you give some details on what you were doing when you hit that exception? What version of MSBuild? Using MSBuild.exe or dotnet build or VS or something else? Evaluation time or during execution? And so on.

@rainersigwald
Copy link
Member

I think we could simplify our GetAccessibleFiles and related methods

/// <summary>
/// Same as Directory.EnumerateFiles(...) except that files that
/// aren't accessible are skipped instead of throwing an exception.
///
/// Other exceptions are passed through.
/// </summary>

Using MS.IO.Redist's EnumerationOptions.IgnoreInaccessible (when available).

@danmoseley
Copy link
Member

I see a bunch of string creating (Substring for example) that could be avoided with wiring spans through this part of the code.

And eventually perhaps this will allow some of the custom globbing to be removed
dotnet/runtime#21362

@KirillOsenkov
Copy link
Member Author

I did add Microsoft.IO.Redist to my benchmark and it is indeed even faster than my handcrafted approach! Kudos to Jeremy!

https://github.com/KirillOsenkov/Benchmarks/blob/f2c45821c2cf7243b040d2c1db5904bab8134cf8/src/Tests/DirectoryEnumeration.cs#L12-L16

I did follow up with the original stack that I'd pasted here, and it's from 2021 😱 My apologies. Most of this issue is now invalid as we have transitioned to Microsoft.IO.Redist!

Remaining issues:

  • Pass IgnoreInaccessible
  • investigate smaller perf issues as indicated by Dan in the previous reply

I won't be offended if we close this issue outright or mark it as low priority ;)

Apologies I should have checked the MSBuild version before filing the issue.

@JeremyKuhne
Copy link
Member

I had figured that MSBuild could, in theory, translate it's \**\ logic into a customized enumerator. I'll also note that I made a raw entry point into error handling if you need to concoct something fancier for the errors: ContinueOnError.

@KirillOsenkov KirillOsenkov closed this as not planned Won't fix, can't repro, duplicate, stale Jul 13, 2023
@AR-May AR-May added the triaged label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants