Does parallelism actually improve performance when doing file-system traversal? #2472
-
I recently built a command line tool called erdtree which leverages the Initially I came in with the thought that parallelism would help, as even though the disk processes user-space requests serially, saturating the disk's queue with requests would improve throughput as it could process things in aggregate without needing to wait for the next request while Recently however, after performing some crude benchmarks I've found that increasing the thread count in my program actually hurts performance ( Worth noting that I'm on an 8-core machine as I write this. It is worth noting that computing disk usages for directories requires a knowledge of the file-system hierarchy and that information is lost when doing parallel traversal meaning I do have to do some calculations after the traversal step to reconstruct the tree back on the main thread but that shouldn't be the bottleneck as evidenced by how All-in-all it could something to do with my code, but any insight on why parallelism would be beneficial in file-system traversal would be much appreciated, as I'm sure Edit: Also worth noting that |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
I'm somewhat confused here. Your narrative suggests that your program gets slower as the thread count increases, but your data suggests something more interesting that. If I'm reading what you've posted correctly, your run with 1 thread takes ~1.4s, a run with 4 threads takes 0.8s and a run with 8 threads takes 1s. So both 4 and 8 threads are faster than 1 thread, but 4 threads is faster than 8 threads. To me, that suggest parallelism helps quite a bit. Your run with 4 threads gives you an almost 2x improvement! But let's see what it looks like in the context of ripgrep:
So I actually get a pretty similar result as you. It looks like things are optimal with 4 threads, and then they steadily get slower after the thread count is increased. This is perhaps a manifestation of something like Amdahl's law. Basically, think about it by analogizing it with human interaction. Is it easier to plan a gathering among 4 of your friends or 32 of them? With 4 friends, the cost of communicating amongst each other is often very low, and it's usually possible to find a day that works for all of them. But with 32? Forget about it. You'll be going back and forth between people as it's likely the entire group does not know each other. You'll likely need to sacrifice something, say, by picking a date that doesn't work for everyone because coordinating between so many people to select a date that works for everyone is so difficult. It's much the same thing with parallelism. If you put too many cooks in the kitchen, then the very act of communicating with all of them starts to have an appreciable effect on bottom line performance. But as you can see, some parallelism helps. Ah... But I did a little trick above to simplify the task. I passed
In this case, 8 threads is faster than 4 threads! Things don't really start getting slower until you get to 16 threads. Why? Because the unit of sequential work per thread in
One thing worth calling out here is that all of my above commands were run with the directory tree clearly cached into memory. ripgrep really isn't causing any kind of disk accesses here. Now if you're specifically looking to benchmark the case where the directory tree is cold and its metadata needs to actually be read from disk, then that is really an entirely separate thing. Parallelism may well indeed still help there, but it's likely its effects and bottom line differences will be different than with an in-memory benchmark. That sort of case is not really in my wheel house for measuring. I kind of just take the perspective that if you're waiting for disk, things are likely to be slow anyway. (Although that is increasingly becoming false, with the rise of PCIE SSDs that can do 6GB/s reads.) Anywho, I'm less sure of how all this applies to your specific use case, but hopefully this analysis will help you. Thanks for the good question! |
Beta Was this translation helpful? Give feedback.
I'm somewhat confused here. Your narrative suggests that your program gets slower as the thread count increases, but your data suggests something more interesting that. If I'm reading what you've posted correctly, your run with 1 thread takes ~1.4s, a run with 4 threads takes 0.8s and a run with 8 threads takes 1s. So both 4 and 8 threads are faster than 1 thread, but 4 threads is faster than 8 threads. To me, that suggest parallelism helps quite a bit. Your run with 4 threads gives you an almost 2x improvement!
But let's see what it looks like in the context of ripgrep: