Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StreamWriter in bulk loader #3542

Merged
merged 11 commits into from
Jun 10, 2019
Merged

Use StreamWriter in bulk loader #3542

merged 11 commits into from
Jun 10, 2019

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Jun 7, 2019

This PR switched transaction based writes to Badger to StreamWriter.

This PR also refactors bulk loader code as follows:

  • Simplify shuffler and reducer code and merge them into one, i.e. reducer.
  • Remove shuffler.go file.
  • Remove metrics.go file.
  • The channel based heap merge was expensive. Switched that with a simple map entries iterator.

With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins.


This change is Reviewable

@manishrjain manishrjain requested a review from a team as a code owner June 7, 2019 22:51
dgraph/cmd/bulk/loader.go Outdated Show resolved Hide resolved
dgraph/cmd/bulk/count_index.go Show resolved Hide resolved
dgraph/cmd/bulk/reduce.go Show resolved Hide resolved
dgraph/cmd/bulk/reduce.go Outdated Show resolved Hide resolved
dgraph/cmd/bulk/reduce.go Show resolved Hide resolved
dgraph/cmd/bulk/reduce.go Outdated Show resolved Hide resolved
Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r1, 20 of 21 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @mangalaman93 and @manishrjain)


dgraph/cmd/bulk/reduce.go, line 47 at r2 (raw file):

func (r *reducer) run() error {
	shardDirs := shardDirs(r.opt.TmpDir)

minor: having the variable have the same name as the function makes this a bit confusing. Maybe the method can be named something like getShardDirs or the variable can be simply named dirs.


dgraph/cmd/bulk/reduce.go, line 68 at r2 (raw file):

			writer := db.NewStreamWriter()
			if err := writer.Prepare(); err != nil {
				panic(err)

why are you using panic here instead of x.Check like in some other places?


dgraph/cmd/bulk/reduce.go, line 213 at r2 (raw file):

		keyChanged := !bytes.Equal(prevKey, me.Key)
		if keyChanged && plistLen > 0 {

I am assuming the keys are returned in order so when the key changes we are done with the count for this key. Is that right?

If so, maybe adding a small comment explaining this invariant would be helpful for future readers of the code.

@manishrjain manishrjain merged commit d336e41 into master Jun 10, 2019
@manishrjain manishrjain deleted the mrjn/bulk-stream-writer branch June 10, 2019 23:36
danielmai pushed a commit that referenced this pull request Jul 15, 2019
Bug fix for bulk loader changes introduced in #3542.

Fixes #3607.

Signed-off-by: பாலாஜி ஜின்னா <[email protected]>
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
This PR switched transaction based writes to Badger to StreamWriter and brings in Badger master into vendor.

This PR also refactors bulk loader code as follows:

- Simplify shuffler and reducer code and merge them into one, i.e. reducer.
- Remove shuffler.go file.
- Remove metrics.go file.
- The channel based heap merge was expensive. Switched that with a simple map entries iterator.

With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins.


Changes:
* Simplified shuffler and reducer code. But, encountered an issue where key sorted order changes due to version append in Badger.
* Working code after StreamWriter integration.
* Vendor Badger in, because it contains fixes to StreamWriter.
* Fix build breakages caused by importing Badger.
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Bug fix for bulk loader changes introduced in dgraph-io#3542.

Fixes dgraph-io#3607.

Signed-off-by: பாலாஜி ஜின்னா <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants