-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.Dir.deleteTree
: Optimize for non-deeply-nested directories
#13073
Conversation
This allows for avoiding an unnecessary lseek (or equivalent) call in places where it can be known that the fd has not had its cursor modified yet.
There are two parts to this: 1. The deleteFile call on the sub_path has been moved outside the loop, since if the first call fails with `IsDir` then it's very likely that all the subsequent calls will do the same. Instead, if the `openIterableDir` call ever hits `NotDir` after the `deleteFile` hit `IsDir`, then we assume that the tree was deleted at some point and can consider the deleteTree a success. 2. Inside the `dir_it.next()` loop, we look at entry.kind and only try doing the relevant (deleteFile/openIterableDir) operation, but always fall back to the other if we get the relevant error (NotDir/IsDir).
`deleteTree` now uses a stack-allocated stack for the first 16 nested directories, and then falls back to the previous implementation (which only keeps 1 directory open at a time) when it runs out of room in its stack. This allows the function to perform as well as a recursive implementation for most use-cases without needing allocation or introducing the possibility of stack overflow.
iter: IterableDir.Iterator, | ||
}; | ||
|
||
var stack = std.BoundedArray(StackItem, 16){}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 16
here is totally arbitrary, no idea what types of things should be taken into account when determining what this size should be.
Might be worth noting that @sizeOf(StackItem)
is 8248
on Linux (all platforms will have similar sizes because of the buf
field of IterableDir.Iterator
), so @sizeOf(std.BoundedArray(StackItem, 16))
ends up being 131976
or ~128.9KiB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 128.9 KiB is significantly more than a standard library function should expect to be available on the stack. For example, the total thread stack size on alpine linux is 128 KiB: https://ariadne.space/2021/06/25/understanding-thread-stack-sizes-and-how-alpine-is-different/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128.9KiB
sounds like a whopping lot for deleting files, when the default can be below 328k: docker-library/cassandra#192 for CI desktop stuff.
I dont think there is any good way to automatically use "more stack", so defaulting to minimal stack usage and giving tools on how to measure stack sizes for the targets (for common desktop stuff) sounds like the best option here.
This would be nice to document as comparison against other common deleteTree
implementations. Other implementations use commonly up to 10 nesting levels of file descriptors to prevent other processes doing weird stuff with the currently to be deleted tree with perf costs.
It would be nice to document that this function does not prevent that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that 8 KiB is quite large for the buffer used by IterableDir.Iterator
. Perhaps that should be reduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think a reasonable limit would be here? A BoundedArray size of 4
would be ~32KiB for reference.
EDIT: Yeah, I am unsure why buf
is so large for IterableDir.Iterator
.
EDIT#2: Seems to be to allow for reading a lot of dirent
's per getdents64
(or the equivalent) syscall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT#2: Seems to be to allow for reading a lot of dirent's per getdents64 (or the equivalent) syscall.
Yes, that seems to be the idea, though my sense is that 1-2 KiB would be pretty much as effective as the current 8 in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my sense is that 1-2 KiB would be pretty much as effective as the current 8 in practice.
You appear to be correct. 1024
and 2048
are pretty much the same from my minimal testing on Linux, and 8192
only wins out on directories that have huge numbers of entries (e.g. 25,000 files in a single directory):
Benchmarks on the dir I've been using (41209 files from npx create-react-app my-app
):
Benchmark #1: ./walk-1024 my-app
Time (mean ± σ): 29.5 ms ± 2.1 ms [User: 4.5 ms, System: 24.8 ms]
Range (min … max): 24.6 ms … 36.6 ms 94 runs
Benchmark #2: ./walk-2048 my-app
Time (mean ± σ): 29.8 ms ± 1.9 ms [User: 4.9 ms, System: 24.7 ms]
Range (min … max): 24.9 ms … 34.5 ms 98 runs
Benchmark #3: ./walk-4096 my-app
Time (mean ± σ): 30.2 ms ± 2.5 ms [User: 4.8 ms, System: 25.2 ms]
Range (min … max): 25.7 ms … 37.4 ms 107 runs
Benchmark #4: ./walk-8192 my-app
Time (mean ± σ): 30.6 ms ± 2.6 ms [User: 5.7 ms, System: 24.6 ms]
Range (min … max): 26.7 ms … 41.1 ms 107 runs
Summary
'./walk-1024 my-app' ran
1.01 ± 0.10 times faster than './walk-2048 my-app'
1.03 ± 0.11 times faster than './walk-4096 my-app'
1.04 ± 0.11 times faster than './walk-8192 my-app'
Benchmark on a single dir that contains 25,000 files in it:
Benchmark #1: ./walk-1024 dir-with-tons-of-entries
Time (mean ± σ): 9.1 ms ± 2.0 ms [User: 0.8 ms, System: 8.3 ms]
Range (min … max): 5.5 ms … 13.2 ms 270 runs
Benchmark #2: ./walk-2048 dir-with-tons-of-entries
Time (mean ± σ): 9.2 ms ± 2.0 ms [User: 0.8 ms, System: 8.4 ms]
Range (min … max): 5.4 ms … 12.5 ms 475 runs
Benchmark #3: ./walk-4096 dir-with-tons-of-entries
Time (mean ± σ): 9.5 ms ± 1.8 ms [User: 0.7 ms, System: 8.9 ms]
Range (min … max): 5.4 ms … 12.7 ms 255 runs
Benchmark #4: ./walk-8192 dir-with-tons-of-entries
Time (mean ± σ): 8.9 ms ± 2.1 ms [User: 1.0 ms, System: 7.9 ms]
Range (min … max): 5.3 ms … 13.0 ms 253 runs
Summary
'./walk-8192 dir-with-tons-of-entries' ran
1.02 ± 0.34 times faster than './walk-1024 dir-with-tons-of-entries'
1.04 ± 0.33 times faster than './walk-2048 dir-with-tons-of-entries'
1.08 ± 0.33 times faster than './walk-4096 dir-with-tons-of-entries'
EDIT: Went ahead and reduced buf
to 1024
. This means that @sizeOf(std.BoundedArray(StackItem, 16))
is 17288
or ~16.9KiB. Still unsure if that's too large or what a reasonable size to shoot for would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 16.9 KiB is quite reasonable as a default. Thanks for taking the time to benchmark this!
Very nice work on this PR, I think this is the right direction for the standard library to take! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of fallback method is not nice for targets with very limited stack size.
I dont think we should be that greedy with the user stack and let the user opt-in to not break usage in docker/low stack size platforms.
If recursion is necessary, please add a limit to at least detect other processes doing weird stuff like moving or adding files into to be deleted tree.
Other implementations guard against (could be called limited safety) that at significant performance costs.
@@ -2035,54 +2076,215 @@ pub const Dir = struct { | |||
/// this function recursively removes its entries and then tries again. | |||
/// This operation is not atomic on most file systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing: does not follow symlinks.
iter: IterableDir.Iterator, | ||
}; | ||
|
||
var stack = std.BoundedArray(StackItem, 16){}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128.9KiB
sounds like a whopping lot for deleting files, when the default can be below 328k: docker-library/cassandra#192 for CI desktop stuff.
I dont think there is any good way to automatically use "more stack", so defaulting to minimal stack usage and giving tools on how to measure stack sizes for the targets (for common desktop stuff) sounds like the best option here.
This would be nice to document as comparison against other common deleteTree
implementations. Other implementations use commonly up to 10 nesting levels of file descriptors to prevent other processes doing weird stuff with the currently to be deleted tree with perf costs.
It would be nice to document that this function does not prevent that.
lib/std/fs.zig
Outdated
top.parent_dir.deleteDir(top.name) catch |err| switch (err) { | ||
error.FileNotFound => {}, | ||
error.DirNotEmpty => { | ||
// reset the iterator and try again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a default and user-configruable upper bound of tries for deletion.
Otherwise, one can not detect other buggy processes continuously inserting directories or files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you open a new issue for this? I believe the problem you're describing exists with the current master
implementation as well, so I'd like to consider it out of the scope of this PR as I'm not trying to alter functionality, just increase the speed.
This was sized large so that `getdents` (and other platforms' equivalents) could provide large amounts of entries per syscall, but some benchmarking seems to indicate that the larger 8192 sizing doesn't actually lead to performance gains outside of edge cases like extremely large amounts of entries within a single directory (e.g. 25,000 files in one directory), and even then the gains are minimal ('./walk-8192 dir-with-tons-of-entries' ran 1.02 ± 0.34 times faster than './walk-1024 dir-with-tons-of-entries'). Note: Sizes 1024 and 2048 had similar performance characteristics, so the smaller of the two was chosen.
We don't control sub_path so it may contain directory components; therefore, NotDir is a potential error when acting on sub_path.
Windows requires the directory handle to be closed before attempting to delete the directory, so now we do that and then re-open it if we need to retry (from getting DirNotEmpty when trying to delete).
…deleteTree/deleteTreeMinStackSize
A summary of the recent changes and some explanations:
Also did some basic benchmarking of the new implementation on Windows:
( EDIT:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! This looks almost ready to merge, however, I believe I did find a regression.
@@ -301,7 +301,7 @@ pub const IterableDir = struct { | |||
.macos, .ios, .freebsd, .netbsd, .dragonfly, .openbsd, .solaris => struct { | |||
dir: Dir, | |||
seek: i64, | |||
buf: [8192]u8, // TODO align(@alignOf(os.system.dirent)), | |||
buf: [1024]u8, // TODO align(@alignOf(os.system.dirent)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these bufs need to be at least PATH_MAX. Consider if a directory contains a file with the longest possible name. In such case, getdents would fail with EINVAL.
I suggest to put this code somewhere:
comptime assert(buf.len >= @sizeOf(os.system.dirent) + MAX_PATH_BYTES);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the relevant max length here is the max filename length (i.e. an individual component) rather than PATH_MAX
. From a cursory search, the filename max length is 255 on Linux and 256 on Windows, so the reduced buf
size should (hopefully) not cause as issue.
I'll investigate this more to be sure though, as I'm planning on working on #8268 as well which is directly related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point - in this case I think it's fine to merge this as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean lowering it to even 512 or 256 should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nektro 256 would be a problem since @sizeOf(os.system.dirent)
is > 256 on Linux. 512 would work but would need to be benchmarked to ensure it doesn't regress performance (see #13073 (comment)).
Thanks for doing this work! |
From @andrewrk in #13070 (comment):
This PR implements this suggestion and (I think) accomplishes everything in the list. It is just as fast as the recursive implementation in #13070 for directories that stay within the fixed-size stack buffer, and only slows down (temporarily) for directories that exceed that level of nesting.
See #13048 for more context. Benchmarking is done with the same folder structure mentioned here:
Linux benchmarks:
(note: the test directory's nesting level stays within the fixed-size buffer)
Benchmarks against the recursive version from #13070 and
rm -rf
:strace
is identical to the recursive verison as long as the nesting stays within the fixed-size buffer:And the syscall count doesn't increase too much if the fallback is only needed for a few things. The directory I'm testing with has 10 nested directories max; here's the
strace -c
after settingdeleteTree
'sBoundedArray
size to 8:and here it is with the
BoundedArray
size set to 4:This also includes some optimizations for what is now the fallback implementation of
deleteTree
, in that it eliminates some redundantdeleteFile
calls from the implementation while not changing its functionality (see #13048 (comment) and the commit message in the list below for more info). This leads to a decent reduction in syscalls but a pretty insignificant decrease in wallclock time.Before:
After:
(note also that the
lseek
syscalls are also gone; this is via the introduction ofIterableDir.iterateAssumeFirstIteration
which the newdeleteFile
implementation also takes advantage of)Benchmark results:
This also adds
IterableDir.Iterator.reset
which was helpful here to allow for resetting the iterator without needing to store theIterableDir
in order to re-create a new iterator.Closes #13048 if merged.