Use StreamWriter in bulk loader #3542

manishrjain · 2019-06-07T22:51:35Z

This PR switched transaction based writes to Badger to StreamWriter.

This PR also refactors bulk loader code as follows:

Simplify shuffler and reducer code and merge them into one, i.e. reducer.
Remove shuffler.go file.
Remove metrics.go file.
The channel based heap merge was expensive. Switched that with a simple map entries iterator.

With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins.

This change is

… key sorted order changes due to version append in Badger.

dgraph/cmd/bulk/loader.go

dgraph/cmd/bulk/count_index.go

dgraph/cmd/bulk/reduce.go

martinmr

Reviewed 4 of 8 files at r1, 20 of 21 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @mangalaman93 and @manishrjain)

dgraph/cmd/bulk/reduce.go, line 47 at r2 (raw file):

func (r *reducer) run() error {
	shardDirs := shardDirs(r.opt.TmpDir)

minor: having the variable have the same name as the function makes this a bit confusing. Maybe the method can be named something like getShardDirs or the variable can be simply named dirs.

dgraph/cmd/bulk/reduce.go, line 68 at r2 (raw file):

			writer := db.NewStreamWriter()
			if err := writer.Prepare(); err != nil {
				panic(err)

why are you using panic here instead of x.Check like in some other places?

dgraph/cmd/bulk/reduce.go, line 213 at r2 (raw file):

		keyChanged := !bytes.Equal(prevKey, me.Key)
		if keyChanged && plistLen > 0 {

I am assuming the keys are returned in order so when the key changes we are done with the count for this key. Is that right?

If so, maybe adding a small comment explaining this invariant would be helpful for future readers of the code.

Bug fix for bulk loader changes introduced in #3542. Fixes #3607. Signed-off-by: பாலாஜி ஜின்னா <[email protected]>

This PR switched transaction based writes to Badger to StreamWriter and brings in Badger master into vendor. This PR also refactors bulk loader code as follows: - Simplify shuffler and reducer code and merge them into one, i.e. reducer. - Remove shuffler.go file. - Remove metrics.go file. - The channel based heap merge was expensive. Switched that with a simple map entries iterator. With these changes, the 21M dataset now takes 2 mins to load from the original 3 mins. Changes: * Simplified shuffler and reducer code. But, encountered an issue where key sorted order changes due to version append in Badger. * Working code after StreamWriter integration. * Vendor Badger in, because it contains fixes to StreamWriter. * Fix build breakages caused by importing Badger.

Bug fix for bulk loader changes introduced in dgraph-io#3542. Fixes dgraph-io#3607. Signed-off-by: பாலாஜி ஜின்னா <[email protected]>

manishrjain added 3 commits June 6, 2019 18:11

Simplified shuffler and reducer code. But, encountered an issue where…

cd7a833

… key sorted order changes due to version append in Badger.

Working code after StreamWriter integration.

fe8acd2

Everything works with stream writer

2704d21

manishrjain requested a review from a team as a code owner June 7, 2019 22:51

golangcibot reviewed Jun 7, 2019

View reviewed changes

manishrjain added 7 commits June 7, 2019 15:59

Self review

ca81a07

Vendor Badger in, because it contains fixes to StreamWriter.

6402ee1

Remove defer

d57dedd

Merge branch 'master' into mrjn/bulk-stream-writer

7a10a63

Error wording fix

069095c

Fix build breakages caused by importing Badger.

c39a351

Add a TODO

c044d26

manishrjain requested review from martinmr and mangalaman93 June 7, 2019 23:49

martinmr suggested changes Jun 10, 2019

View reviewed changes

Address Martin's review

eb808a9

manishrjain merged commit d336e41 into master Jun 10, 2019

manishrjain deleted the mrjn/bulk-stream-writer branch June 10, 2019 23:36

danielmai mentioned this pull request Jun 27, 2019

Query error after bulk load #3607

Closed

danielmai pushed a commit that referenced this pull request Jul 15, 2019

Set UserMeta in bulk loader. (#3649)

85e0b45

Bug fix for bulk loader changes introduced in #3542. Fixes #3607. Signed-off-by: பாலாஜி ஜின்னா <[email protected]>

ashish-goswami mentioned this pull request Sep 27, 2019

Use Stream Writer of Badger in Bulk Loader #3463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use StreamWriter in bulk loader #3542

Use StreamWriter in bulk loader #3542

manishrjain commented Jun 7, 2019 •

edited

Loading

martinmr left a comment

Use StreamWriter in bulk loader #3542

Use StreamWriter in bulk loader #3542

Conversation

manishrjain commented Jun 7, 2019 • edited Loading

martinmr left a comment

Choose a reason for hiding this comment

manishrjain commented Jun 7, 2019 •

edited

Loading