Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify mkdir_p (and make it faster in some cases) #60

Merged
merged 2 commits into from
Oct 14, 2021

Conversation

deivid-rodriguez
Copy link
Contributor

There's two commits here:

  • The first one simplifies building the list of paths that need to be created. Previously the exit condition for the until loop was that File.dirname(path) == path (root path), but that's unnecessary. It's enough to exit the loop once we find a directory that already exists while traversing the path upwards.

  • The second one removes an optimization that doesn't seem to affect much the case it's meant to optimize (creating a folder that doesn't exist already under a folder that already exists), but does affect other use cases (for example, when the target folder already exists). To measure the effect of this commit, I copied the method with the optimization removed as FileUtils.mkidr_p_new, and run the following micro benchmark:

require "benchmark/ips"

Benchmark.ips do |x|
  x.report("old mkdir_p - exists") do
    FileUtils.mkdir_p "/tmp"
  end

  x.report("new_mkdir_p - exists") do
    FileUtils.mkdir_p_new "/tmp"
  end

  x.compare!
end

FileUtils.rm_rf "/tmp/foo"

Benchmark.ips do |x|
  x.report("old mkdir_p - doesnt exist, parent exists") do
    FileUtils.mkdir_p "/tmp/foo"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - doesnt exist, parent exists") do
    FileUtils.mkdir_p_new "/tmp/foo"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end

Benchmark.ips do |x|
  x.report("old mkdir_p - doesnt exist, parent either") do
    FileUtils.mkdir_p "/tmp/foo/bar"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - doesnt exist, parent either") do
    FileUtils.mkdir_p_new "/tmp/foo/bar"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end

Benchmark.ips do |x|
  x.report("old mkdir_p - more levels") do
    FileUtils.mkdir_p "/tmp/foo/bar/baz"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - more levels") do
    FileUtils.mkdir_p_new "/tmp/foo/bar/baz"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end

Results were:

$ ruby -Ilib -rfileutils bench.rb 
Warming up --------------------------------------
old mkdir_p - exists    15.914k i/100ms
new_mkdir_p - exists    46.512k i/100ms
Calculating -------------------------------------
old mkdir_p - exists    161.461k (± 3.2%) i/s -    811.614k in   5.032315s
new_mkdir_p - exists    468.192k (± 2.9%) i/s -      2.372M in   5.071225s

Comparison:
new_mkdir_p - exists:   468192.1 i/s
old mkdir_p - exists:   161461.0 i/s - 2.90x  (± 0.00) slower

Warming up --------------------------------------
old mkdir_p - doesnt exist, parent exists
                         2.142k i/100ms
new_mkdir_p - doesnt exist, parent exists
                         1.961k i/100ms
Calculating -------------------------------------
old mkdir_p - doesnt exist, parent exists
                         21.242k (± 6.7%) i/s -    107.100k in   5.069206s
new_mkdir_p - doesnt exist, parent exists
                         19.682k (± 4.2%) i/s -    100.011k in   5.091961s

Comparison:
old mkdir_p - doesnt exist, parent exists:    21241.7 i/s
new_mkdir_p - doesnt exist, parent exists:    19681.7 i/s - same-ish: difference falls within error

Warming up --------------------------------------
old mkdir_p - doesnt exist, parent either
                       945.000  i/100ms
new_mkdir_p - doesnt exist, parent either
                         1.002k i/100ms
Calculating -------------------------------------
old mkdir_p - doesnt exist, parent either
                          9.689k (± 4.4%) i/s -     49.140k in   5.084342s
new_mkdir_p - doesnt exist, parent either
                         10.806k (± 4.6%) i/s -     54.108k in   5.020714s

Comparison:
new_mkdir_p - doesnt exist, parent either:    10806.3 i/s
old mkdir_p - doesnt exist, parent either:     9689.3 i/s - 1.12x  (± 0.00) slower

Warming up --------------------------------------
old mkdir_p - more levels
                       702.000  i/100ms
new_mkdir_p - more levels
                       775.000  i/100ms
Calculating -------------------------------------
old mkdir_p - more levels
                          7.046k (± 3.5%) i/s -     35.802k in   5.087548s
new_mkdir_p - more levels
                          7.685k (± 5.5%) i/s -     38.750k in   5.061351s

Comparison:
new_mkdir_p - more levels:     7685.1 i/s
old mkdir_p - more levels:     7046.4 i/s - same-ish: difference falls within error

Doing it this way is simpler and it doesn't end up adding "/" to the
list of folders, so it doesn't need to be removed later.
I think it's debatable which is the most common usage of
`FileUtils.mkdir_p`, but even assuming the most common use case is
creating a folder when it doesn't previously exist but the parent does,
this optimization doesn't seem to have a noticiable effect there while
harming other use cases.

For benchmarks, I created this script

```ruby
require "benchmark/ips"

Benchmark.ips do |x|
  x.report("old mkdir_p - exists") do
    FileUtils.mkdir_p "/tmp"
  end

  x.report("new_mkdir_p - exists") do
    FileUtils.mkdir_p_new "/tmp"
  end

  x.compare!
end

FileUtils.rm_rf "/tmp/foo"

Benchmark.ips do |x|
  x.report("old mkdir_p - doesnt exist, parent exists") do
    FileUtils.mkdir_p "/tmp/foo"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - doesnt exist, parent exists") do
    FileUtils.mkdir_p_new "/tmp/foo"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end

Benchmark.ips do |x|
  x.report("old mkdir_p - doesnt exist, parent either") do
    FileUtils.mkdir_p "/tmp/foo/bar"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - doesnt exist, parent either") do
    FileUtils.mkdir_p_new "/tmp/foo/bar"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end

Benchmark.ips do |x|
  x.report("old mkdir_p - more levels") do
    FileUtils.mkdir_p "/tmp/foo/bar/baz"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.report("new_mkdir_p - more levels") do
    FileUtils.mkdir_p_new "/tmp/foo/bar/baz"
    FileUtils.rm_rf "/tmp/foo"
  end

  x.compare!
end
```

and copied the method with the "optimization" removed as
`FileUtils.mkdir_p_new`. The results are as below:

```
Warming up --------------------------------------
old mkdir_p - exists    15.914k i/100ms
new_mkdir_p - exists    46.512k i/100ms
Calculating -------------------------------------
old mkdir_p - exists    161.461k (± 3.2%) i/s -    811.614k in   5.032315s
new_mkdir_p - exists    468.192k (± 2.9%) i/s -      2.372M in   5.071225s

Comparison:
new_mkdir_p - exists:   468192.1 i/s
old mkdir_p - exists:   161461.0 i/s - 2.90x  (± 0.00) slower

Warming up --------------------------------------
old mkdir_p - doesnt exist, parent exists
                         2.142k i/100ms
new_mkdir_p - doesnt exist, parent exists
                         1.961k i/100ms
Calculating -------------------------------------
old mkdir_p - doesnt exist, parent exists
                         21.242k (± 6.7%) i/s -    107.100k in   5.069206s
new_mkdir_p - doesnt exist, parent exists
                         19.682k (± 4.2%) i/s -    100.011k in   5.091961s

Comparison:
old mkdir_p - doesnt exist, parent exists:    21241.7 i/s
new_mkdir_p - doesnt exist, parent exists:    19681.7 i/s - same-ish: difference falls within error

Warming up --------------------------------------
old mkdir_p - doesnt exist, parent either
                       945.000  i/100ms
new_mkdir_p - doesnt exist, parent either
                         1.002k i/100ms
Calculating -------------------------------------
old mkdir_p - doesnt exist, parent either
                          9.689k (± 4.4%) i/s -     49.140k in   5.084342s
new_mkdir_p - doesnt exist, parent either
                         10.806k (± 4.6%) i/s -     54.108k in   5.020714s

Comparison:
new_mkdir_p - doesnt exist, parent either:    10806.3 i/s
old mkdir_p - doesnt exist, parent either:     9689.3 i/s - 1.12x  (± 0.00) slower

Warming up --------------------------------------
old mkdir_p - more levels
                       702.000  i/100ms
new_mkdir_p - more levels
                       775.000  i/100ms
Calculating -------------------------------------
old mkdir_p - more levels
                          7.046k (± 3.5%) i/s -     35.802k in   5.087548s
new_mkdir_p - more levels
                          7.685k (± 5.5%) i/s -     38.750k in   5.061351s

Comparison:
new_mkdir_p - more levels:     7685.1 i/s
old mkdir_p - more levels:     7046.4 i/s - same-ish: difference falls within error
```

I think it's better to keep the code simpler is the optimization is not
so clear like in this case.
@deivid-rodriguez
Copy link
Contributor Author

I also wanted to remove all code using remove_trailing_slash which seems unnecessary too but I found the motivation at 1dc822f (it's necessary to support NetBSD/Alpha 1.6.1).

@deivid-rodriguez
Copy link
Contributor Author

Hei @jeremyevans, I noticed that the code I'm modifying in the first commit here is relatively new. I think my solution doesn't change your idea in #53, just makes the code simpler. Let me know what you think, in case you have some time. Thanks!

Copy link
Contributor

@jeremyevans jeremyevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. It makes the code simpler and shouldn't change the behavior, creating only the minimum number of directories necessary.

@hsbt hsbt merged commit 8a6de48 into ruby:master Oct 14, 2021
@deivid-rodriguez deivid-rodriguez deleted the speed_up_mkdir_p branch October 14, 2021 08:09
@deivid-rodriguez
Copy link
Contributor Author

Thanks!

@mame
Copy link
Member

mame commented Jul 27, 2022

@deivid-rodriguez Actually it changed the behavior unfortunately. https://bugs.ruby-lang.org/issues/18941 Could you check it out?

@deivid-rodriguez
Copy link
Contributor Author

It seems some Windows specific issue, right? I don't have much time right now but I will try to have a look!

@deivid-rodriguez
Copy link
Contributor Author

Oh, wait, I see the problem now. Let me fix it!

@deivid-rodriguez
Copy link
Contributor Author

So, I don't have a Windows machine right now to try this but this is my guess.

Intuitively I think we could restore the previous behavior by checking File.dirname(path) == path as the condition to terminate the loop, as the previous comment suggested:

 until path == stack.last   # dirname("/")=="/", dirname("C:/")=="C:/"

However, that doesn't work for me:

irb(main):003:0> File.dirname("A:")
=> "."

Maybe that's because I'm not on Windows, but this check in Pathname sources used to detect being on Windows suggests that it doesn't work there either: https://github.com/ruby/pathname/blob/cd0d25aa70b4bfa2d6b42f33db8af29a6e75bf0c/lib/pathname.rb#L38-L42.

So I'm not sure, I guess we need to try it on Windows.

But in any case, I think reverting e842a0e would also fix the issue because it would always eagerly try to create the directory and generate the expected error in this case.

@deivid-rodriguez
Copy link
Contributor Author

@mame #100 should fix it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants