Introduce StreamDone in bulk loader #4297

ashish-goswami · 2019-11-20T10:51:18Z

Depends on #4200

During reduce phase, we insert predicate in sorted order to Badger instances. Each predicate has one underlying stream in stream writer and each stream has one underlying table builder. Each table builder has at least 1MB of buffer with it. Previously we had no way to close a stream(hence underlying table builder was in memory until flush is called on steam writer). With stream done support in BadgerV2.0, we can closed stream once we are done writing a predicate.

This change can help us reduce memory usage when we have large number of predicates.
I did benchmarking by creating RDF file using below program:

package main

import (
	"fmt"
	"os"
)

var (
	pc = 10000
	rc = 10000
)

func main() {
  filename := fmt.Sprintf("file-%d-%d", pc, rc)
	f, err := os.Create(filename)
	if err != nil {
		fmt.Println("unable to open file ", err)
		return
	}
	for i := 0; i < pc; i++ {
		for j := 0; j < rc; j++ {
			a := fmt.Sprintf("_:id%d <pred%d> \"val\" .", i, j)
			f.Write([]byte(a))
			f.Write([]byte("\n"))
		}
	}
	f.Sync()
	f.Close()
}

Above program creates a RDF file with 10000 predicates and each predicate has 10000 records for it(Total 100M records). I ran bulk loader using below command with empty schema.

dgraph_master bulk -f file-10000-10000.rdf.gz -s empty.schema

System configuration: 16 core, 64GB RAM

Here are my observations:

Current Master:
Map phase peak memory usage(RSS): ~6GB
Reduce phase peak memory usage(RSS): ~20GB
Time to complete: 4min31sec

This PR:
Map phase peak memory usage(RSS): ~6GB
Reduce phase peak memory usage(RSS): ~7.2GB
Time to complete: 4min16sec

This change is

animesh2049

Reviewable status: 0 of 53 files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, @manishrjain, and @martinmr)

a discussion (no related file):

dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):

	}

	addDone := func(l *bpb.KVList, doneSteams []uint32) *bpb.KVList {

Any specific reason for having a return value? this function can silently modify l and return.

martinmr

but I'll let somebody else give final approval.

Reviewed 53 of 53 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)

dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):

Previously, animesh2049 (Animesh Chandra Pathak) wrote…

Any specific reason for having a return value? this function can silently modify l and return.

Agree. Also, note that if you make the change, it's usually a convention to have l be the last argument (since it's kind of acting as an output).

ashish-goswami

Reviewable status: 52 of 53 files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, @manishrjain, and @martinmr)

dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Agree. Also, note that if you make the change, it's usually a convention to have l be the last argument (since it's kind of acting as an output).

Done.

mangalaman93

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)

dgraph/cmd/bulk/reduce.go, line 168 at r2 (raw file):

	// Once we have processed all records from single stream, we can mark that stream as done.
	// This will close underlying table builder in Badger for stream. Since we preallocate 1 MB
	// of memory for each table builder, this can result in memory save in case we have large

saving*

mangalaman93

Reviewed 1 of 9 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @manishrjain)

ashish-goswami

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)

dgraph/cmd/bulk/reduce.go, line 168 at r2 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

saving*

fixed

manishrjain

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @mangalaman93)

dgraph/cmd/bulk/reduce.go, line 173 at r3 (raw file):

	// write call. This can also be optimised if required.
	addDone := func(doneSteams []uint32, l *bpb.KVList) {
		for _, SID := range doneSteams {

streamId

dgraph/cmd/bulk/reduce.go, line 191 at r3 (raw file):

			for _, kv := range list.Kv {
				setStreamId(kv)
				updateDoneStreams(kv.StreamId)

inline the func logic.

ashish-goswami

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)

dgraph/cmd/bulk/reduce.go, line 173 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

streamId

Done.

dgraph/cmd/bulk/reduce.go, line 191 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

inline the func logic.

Done.

ashish-goswami added 2 commits November 20, 2019 15:15

Import latest Badger and update import paths

ea0b04a

Introduce StreamDone

166a357

ashish-goswami requested review from martinmr, manishrjain, mangalaman93 and animesh2049 November 20, 2019 10:51

ashish-goswami requested a review from a team as a code owner November 20, 2019 10:51

animesh2049 approved these changes Nov 20, 2019

View reviewed changes

martinmr approved these changes Nov 21, 2019

View reviewed changes

Address review comments

9ab4976

ashish-goswami commented Nov 22, 2019

View reviewed changes

ashish-goswami added 2 commits November 25, 2019 16:29

Merge remote-tracking branch 'origin/master' into ashish/bulkloader

9f75bbc

Set Badger cache size to 1MB

c7a5961

mangalaman93 requested review from animesh2049 and martinmr November 25, 2019 12:06

mangalaman93 requested changes Nov 25, 2019

View reviewed changes

mangalaman93 reviewed Nov 25, 2019

View reviewed changes

Minor fix

e2eefa9

ashish-goswami commented Nov 25, 2019

View reviewed changes

manishrjain approved these changes Nov 25, 2019

View reviewed changes

Address review comments

e46f17e

ashish-goswami commented Nov 26, 2019

View reviewed changes

ashish-goswami merged commit ae679e6 into master Nov 28, 2019

ashish-goswami deleted the ashish/bulkloader branch November 28, 2019 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce StreamDone in bulk loader #4297

Introduce StreamDone in bulk loader #4297

ashish-goswami commented Nov 20, 2019 •

edited by manishrjain

Loading

animesh2049 left a comment

martinmr left a comment

ashish-goswami left a comment

mangalaman93 left a comment

mangalaman93 left a comment

ashish-goswami left a comment

manishrjain left a comment

ashish-goswami left a comment

Introduce StreamDone in bulk loader #4297

Introduce StreamDone in bulk loader #4297

Conversation

ashish-goswami commented Nov 20, 2019 • edited by manishrjain Loading

animesh2049 left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

ashish-goswami left a comment

Choose a reason for hiding this comment

mangalaman93 left a comment

Choose a reason for hiding this comment

mangalaman93 left a comment

Choose a reason for hiding this comment

ashish-goswami left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

ashish-goswami left a comment

Choose a reason for hiding this comment

ashish-goswami commented Nov 20, 2019 •

edited by manishrjain

Loading