Does parallelism actually improve performance when doing file-system traversal? #2472

solidiquis · 2023-03-22T16:54:05Z

solidiquis
Mar 22, 2023

I recently built a command line tool called erdtree which leverages the ignore crate for its parallel filesystem traversal capabilities amongst other things. To give you some context on what the program does, it simply traverses the file-system, records disk usages, constructs a virtual tree data structure, and uses it to print an pretty ASCII tree of your local file hierarchy.

Initially I came in with the thought that parallelism would help, as even though the disk processes user-space requests serially, saturating the disk's queue with requests would improve throughput as it could process things in aggregate without needing to wait for the next request while erdtree does some CPU-bound work with the data it received from the previous request.

Recently however, after performing some crude benchmarks I've found that increasing the thread count in my program actually hurts performance (-t is the option used to set the size of the threadpool used by WalkParallel (see here and here)

Worth noting that I'm on an 8-core machine as I write this.

It is worth noting that computing disk usages for directories requires a knowledge of the file-system hierarchy and that information is lost when doing parallel traversal meaning I do have to do some calculations after the traversal step to reconstruct the tree back on the main thread but that shouldn't be the bottleneck as evidenced by how -t affects performance. This flame-graph might also be helpful in assessing the situation:

All-in-all it could something to do with my code, but any insight on why parallelism would be beneficial in file-system traversal would be much appreciated, as I'm sure ParallelWalker exists for a good reason. Thank you!

Edit: Also worth noting that ParallelWalker is configured in erdtree to sized at std::thread::available_parallelism by default. I'm trying to determine if I should just can ParallelWalker altogether and just use the single-threaded Walker and giving folks the option to use ParallelWalker.

Answered by BurntSushi

Mar 22, 2023

I'm somewhat confused here. Your narrative suggests that your program gets slower as the thread count increases, but your data suggests something more interesting that. If I'm reading what you've posted correctly, your run with 1 thread takes ~1.4s, a run with 4 threads takes 0.8s and a run with 8 threads takes 1s. So both 4 and 8 threads are faster than 1 thread, but 4 threads is faster than 8 threads. To me, that suggest parallelism helps quite a bit. Your run with 4 threads gives you an almost 2x improvement!

But let's see what it looks like in the context of ripgrep:

$ git remote -v
origin  [email protected]:nwjs/chromium.src (fetch)
origin  [email protected]:nwjs/chromium.src (push)

$ git…

View full answer

BurntSushi · 2023-03-22T17:21:13Z

BurntSushi
Mar 22, 2023
Maintainer

I'm somewhat confused here. Your narrative suggests that your program gets slower as the thread count increases, but your data suggests something more interesting that. If I'm reading what you've posted correctly, your run with 1 thread takes ~1.4s, a run with 4 threads takes 0.8s and a run with 8 threads takes 1s. So both 4 and 8 threads are faster than 1 thread, but 4 threads is faster than 8 threads. To me, that suggest parallelism helps quite a bit. Your run with 4 threads gives you an almost 2x improvement!

But let's see what it looks like in the context of ripgrep:

$ git remote -v
origin  [email protected]:nwjs/chromium.src (fetch)
origin  [email protected]:nwjs/chromium.src (push)

$ git rev-parse HEAD
5d32cab40f738932eddc017980e2e409c5abef2c

$ time rg -uuu --files -j1 | wc -l
394576

real    0.255
user    0.105
sys     0.148
maxmem  8 MB
faults  0

$ time rg -uuu --files -j2 | wc -l
394576

real    0.203
user    0.279
sys     0.214
maxmem  8 MB
faults  0

$ time rg -uuu --files -j4 | wc -l
394576

real    0.163
user    0.409
sys     0.290
maxmem  8 MB
faults  0

$ time rg -uuu --files -j8 | wc -l
394576

real    0.198
user    0.769
sys     0.723
maxmem  8 MB
faults  0

$ time rg -uuu --files -j12 | wc -l
394576

real    0.222
user    1.016
sys     1.446
maxmem  8 MB
faults  0

$ time rg -uuu --files -j16 | wc -l
394576

real    0.232
user    0.861
sys     2.519
maxmem  8 MB
faults  0

$ time rg -uuu --files -j32 | wc -l
394576

real    0.274
user    1.126
sys     4.750
maxmem  11 MB
faults  0

So I actually get a pretty similar result as you. It looks like things are optimal with 4 threads, and then they steadily get slower after the thread count is increased.

This is perhaps a manifestation of something like Amdahl's law. Basically, think about it by analogizing it with human interaction. Is it easier to plan a gathering among 4 of your friends or 32 of them? With 4 friends, the cost of communicating amongst each other is often very low, and it's usually possible to find a day that works for all of them. But with 32? Forget about it. You'll be going back and forth between people as it's likely the entire group does not know each other. You'll likely need to sacrifice something, say, by picking a date that doesn't work for everyone because coordinating between so many people to select a date that works for everyone is so difficult.

It's much the same thing with parallelism. If you put too many cooks in the kitchen, then the very act of communicating with all of them starts to have an appreciable effect on bottom line performance.

But as you can see, some parallelism helps.

Ah... But I did a little trick above to simplify the task. I passed -uuu to ripgrep, which causes it to just walk the tree without any kind of smart filtering. But if you're using ignore, then part of what that crate does is filtering. And filtering is emphatically not free. Building those gitignores into matchers and then applying them has a cost. So let's drop the -uuu and see how that changes things:

$ time rg --files -j1 | wc -l
393092

real    0.701
user    0.462
sys     0.236
maxmem  8 MB
faults  0

$ time rg --files -j2 | wc -l
393092

real    0.479
user    0.750
sys     0.297
maxmem  8 MB
faults  0

$ time rg --files -j4 | wc -l
393092

real    0.293
user    0.966
sys     0.259
maxmem  9 MB
faults  0

$ time rg --files -j8 | wc -l
393092

real    0.214
user    1.386
sys     0.318
maxmem  12 MB
faults  0

$ time rg --files -j12 | wc -l
393092

real    0.217
user    1.725
sys     0.723
maxmem  14 MB
faults  0

$ time rg --files -j16 | wc -l
393092

real    0.230
user    1.885
sys     1.405
maxmem  16 MB
faults  0

$ time rg --files -j32 | wc -l
393092

real    0.249
user    2.111
sys     2.965
maxmem  22 MB
faults  0

In this case, 8 threads is faster than 4 threads! Things don't really start getting slower until you get to 16 threads.

Why? Because the unit of sequential work per thread in rg --files is bigger than rg -uuu --files for this particulary directory tree. The chromium repository has an absolute boat load of gitignore files. And the key bit about parallelism in ignore is that the construction and matching of each of the gitignore files is itself parallelized. Notice for example how much difference there is in rg --files -uuu -j1 and rg --files -j1. The former is much faster because it doesn't have to deal with gitignores.

Initially I came in with the thought that parallelism would help, as even though the disk processes user-space requests serially, saturating the disk's queue with requests would improve throughput as it could process things in aggregate without needing to wait for the next request while erdtree does some CPU-bound work with the data it received from the previous request.

One thing worth calling out here is that all of my above commands were run with the directory tree clearly cached into memory. ripgrep really isn't causing any kind of disk accesses here.

Now if you're specifically looking to benchmark the case where the directory tree is cold and its metadata needs to actually be read from disk, then that is really an entirely separate thing. Parallelism may well indeed still help there, but it's likely its effects and bottom line differences will be different than with an in-memory benchmark. That sort of case is not really in my wheel house for measuring. I kind of just take the perspective that if you're waiting for disk, things are likely to be slow anyway. (Although that is increasingly becoming false, with the rise of PCIE SSDs that can do 6GB/s reads.)

Anywho, I'm less sure of how all this applies to your specific use case, but hopefully this analysis will help you. Thanks for the good question!

2 replies

solidiquis Mar 22, 2023
Author

Thank you very much for the insight! And yeah the diminishing returns threw me off; I didn't even really notice that some parallelism was better than none. Indeed, Amdahl's Law does help to contextualize things. Deeply appreciate the very thorough response!

manikantag Apr 13, 2023

@BurntSushi I really appreciate you taking time to reply to all the issues & questions in very detailed manner and the wiki links for some of the laws you mention are added bonus (very informative). Keep up the good work. Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does parallelism actually improve performance when doing file-system traversal? #2472

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Does parallelism actually improve performance when doing file-system traversal? #2472

solidiquis Mar 22, 2023

Replies: 1 comment · 2 replies

BurntSushi Mar 22, 2023 Maintainer

solidiquis Mar 22, 2023 Author

manikantag Apr 13, 2023

solidiquis
Mar 22, 2023

Replies: 1 comment 2 replies

BurntSushi
Mar 22, 2023
Maintainer

solidiquis Mar 22, 2023
Author