Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast add stuff #2039

Merged
merged 13 commits into from
Dec 8, 2015
Merged

Fast add stuff #2039

merged 13 commits into from
Dec 8, 2015

Conversation

whyrusleeping
Copy link
Member

This is based on the flushing disable PR i put up the other day since I needed some changes from there.

This makes ipfs add use the mfs code under the hood to make adds a bit faster.

License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
@jbenet jbenet added the status/in-progress In progress label Dec 5, 2015
added QmaowqjedBkUrMUXgzt9c2ZnAJncM9jpJtkFfgdFstGr5a planets/.charon.txt
added QmU4zFD5eJtRBsWC63AvpozM9Atiadg9kPVTuTrnCYJiNF planets/.pluto.txt
added QmZy3khu7qf696i5HtkgL2NotsCZ8wzvNZJ1eUdA5n8KaV planets/mars.txt
added QmQnv4m3Q5512zgVtpbJ9z85osQrzZzGRn934AGh6iVEXz planets/venus.txt
added Qmf6rbs5GF85anDuoxpSAdtuZPM9D2Yt3HngzjUVSQ7kDV planets/.asteroids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should sort the files, otherwise non-determinism.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are sorted on input, but yeah. probably makes sense to sort it in the tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the outputs are based on concurrent behavior then I guess we have to give up sorting in the output and that's probably ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but theres no concurrency here at this point. Its pretty fast without it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's sort on tests. we should probably warn in ipfs add --help that the output order is not deterministic.

License: MIT
Signed-off-by: Jeromy <[email protected]>
@whyrusleeping
Copy link
Member Author

@jbenet could use some CR here. I'll squash and rework the commits while addressing CR

const notRecursiveFmtStr = "'%s' is a directory, use the '-%s' flag to specify directories"
const dirNotSupportedFmtStr = "Invalid path '%s', argument '%s' does not support directories"

func appendFile(fpath string, argDef *cmds.Argument, recursive bool) (files.File, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 thank you very much for the simplification

License: MIT
Signed-off-by: Jeromy <[email protected]>
@@ -29,11 +29,6 @@ test_expect_success "'ipfs repo gc' succeeds" '
ipfs repo gc >gc_out_actual
'

test_expect_success "'ipfs repo gc' looks good (patch root)" '
PATCH_ROOT=QmQXirSbubiySKnqaFyfs5YzziXRB5JEVQVjU6xsd7innr &&
grep "removed $PATCH_ROOT" gc_out_actual
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed? not understanding why it no longer removes this root node

@jbenet
Copy link
Member

jbenet commented Dec 6, 2015

@whyrusleeping some comments above -- otherwise this all LGTM. fantastic improvement!



}
if file == nil {
break
}

node, err := params.AddFile(file)
err = params.addFile(file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so i think directories should add every file in a new goroutine (with rate limiting). i had started on this earlier, but aborted. the general idea was every dir -> files fan out uses goroutines, like:

ctx, cancel := context.WithCancel(params.ctx)
defer cancel()
// this context will cancel out all the children if we error out early.
// would need to wire the addFile funcs with a ctx. OR we _could_ cancel 
// the entire context of thea adder instead...

errs := make(chan error, 10) // some room to avoid waiting for errors
                // dont think we have access to read len(dir) though

for {
  file, err := dir.NextFile()
  if err != nil && err != io.EOF {
    return err // defer cancel unblocks other children
  }
  if file == nil {
    break // done with files
  }

  <-params.concurrency // adder-global parameter
  go func(file) {
    defer func() {
      params.concurrency<- struct{}{} // done
    }()

    err := params.addFile(ctx, file) // ctx cancel should unblock addFile

    // return even nil errors, so we force the parent to wait
    // until _all_ children are done during non-error ops.
    // but bail on ctx.Done.
    select {
    case errs<- err:
    case ctx.Done():
    }
  }(file)
}

for err := range errs {
  if err == nil {
    continue
  }
  if _, ok := err.(*hiddenFileError); ok {
    continue // hidden file error, skip file
  }
  // error!
  return err // defer cancel unblocks + terminates all children.
}

// all children done.
return nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hard to do right, this code will deadlock when it encounters a directory deeper than your concurrency factor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add something like:

if file.IsDir() {
    params.addFile(ctx, file)
} else {
   // concurrency thing
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i realized that after posting, we can fix that by bumping up the concurrency level one notch in dirs (i.e. semaup at the top of inside addDir, and defer semadown) -- better than this one o/ because this doesn't allow concurrent add of dirs-- and complicates the code by doing two different things (double the error handling, etc).

@jbenet
Copy link
Member

jbenet commented Dec 6, 2015

@whyrusleeping general comment: i think adding files should be concurrent because a single add should leverage as much concurrent io + hashing as possible (right now it sequentially trades off between io and hashing).

I think adding https://github.com/ipfs/go-ipfs/pull/2039/files#r46767388 will fix it

License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
@whyrusleeping
Copy link
Member Author

@jbenet would you be opposed to having the concurrency stuff done in a separate PR? This one is already getting pretty large

@jbenet
Copy link
Member

jbenet commented Dec 6, 2015

@whyrusleeping my thought is "it's almost there!" -- but sure, if it's easier for you

added QmQkib3f9XNX5sj6WEahLUPFpheTcwSRJwUCSvjcv8b9by _jo7
added Qme987pqNBhZZXy4ckeXiR7zaRQwBabB7fTgHurW2yJfNu 4r93
added QmVPwNy8pZegpsNmsjjZvdTQn4uCeuZgtzhgWhRSQWjK9x gnz66h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still needs sorting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, i guess it doesn't matter what order they're in? maybe just restore the original order to make sure/

@whyrusleeping
Copy link
Member Author

good to merge?

@jbenet jbenet mentioned this pull request Dec 7, 2015
14 tasks
License: MIT
Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy <[email protected]>
jbenet added a commit that referenced this pull request Dec 8, 2015
@jbenet jbenet merged commit fba5fca into dev0.4.0 Dec 8, 2015
@jbenet jbenet deleted the fast-add-stuff branch December 8, 2015 07:10
@rht
Copy link
Contributor

rht commented Dec 16, 2015

It sure is fast
outdata
(filesize 1KB each)

Memory growth:
memory

However, it is still possible to get to ~O(git) soon if intermediate root nodes are not created (even if the hashings are delayed in this PR) #1964 (comment).

@jbenet
Copy link
Member

jbenet commented Dec 16, 2015

great stats, thanks @rht.

would be awesome to get these auto-generated by running one script. could
add it under tests/bench/ or something

On Tue, Dec 15, 2015 at 10:28 PM rht [email protected] wrote:

It sure is fast
[image: outdata]
https://cloud.githubusercontent.com/assets/395821/11830563/f2d7ad10-a3d8-11e5-868b-f43bce3e4441.png
(filesize 1KB each)

Memory growth:
[image: memory]
https://cloud.githubusercontent.com/assets/395821/11830592/3a881550-a3d9-11e5-877d-a85d6b11f485.png

However, it is still possible to get to ~O(git) soon if intermediate root
nodes are not created (even if the hashings are delayed in this PR) #1964
(comment)
#1964 (comment).


Reply to this email directly or view it on GitHub
#2039 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants