-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce StreamDone in bulk loader #4297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 53 files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, @manishrjain, and @martinmr)
a discussion (no related file):
dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):
} addDone := func(l *bpb.KVList, doneSteams []uint32) *bpb.KVList {
Any specific reason for having a return value? this function can silently modify l
and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I'll let somebody else give final approval.
Reviewed 53 of 53 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)
dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):
Previously, animesh2049 (Animesh Chandra Pathak) wrote…
Any specific reason for having a return value? this function can silently modify
l
and return.
Agree. Also, note that if you make the change, it's usually a convention to have l
be the last argument (since it's kind of acting as an output).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 52 of 53 files reviewed, 2 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, @manishrjain, and @martinmr)
dgraph/cmd/bulk/reduce.go, line 168 at r1 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
Agree. Also, note that if you make the change, it's usually a convention to have
l
be the last argument (since it's kind of acting as an output).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)
dgraph/cmd/bulk/reduce.go, line 168 at r2 (raw file):
// Once we have processed all records from single stream, we can mark that stream as done. // This will close underlying table builder in Badger for stream. Since we preallocate 1 MB // of memory for each table builder, this can result in memory save in case we have large
saving*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @manishrjain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)
dgraph/cmd/bulk/reduce.go, line 168 at r2 (raw file):
Previously, mangalaman93 (Aman Mangal) wrote…
saving*
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @mangalaman93)
dgraph/cmd/bulk/reduce.go, line 173 at r3 (raw file):
// write call. This can also be optimised if required. addDone := func(doneSteams []uint32, l *bpb.KVList) { for _, SID := range doneSteams {
streamId
dgraph/cmd/bulk/reduce.go, line 191 at r3 (raw file):
for _, kv := range list.Kv { setStreamId(kv) updateDoneStreams(kv.StreamId)
inline the func logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @mangalaman93, and @manishrjain)
dgraph/cmd/bulk/reduce.go, line 173 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
streamId
Done.
dgraph/cmd/bulk/reduce.go, line 191 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
inline the func logic.
Done.
Depends on #4200
During reduce phase, we insert predicate in sorted order to Badger instances. Each predicate has one underlying
stream
instream writer
and eachstream
has one underlyingtable builder
. Eachtable builder
has at least 1MB of buffer with it. Previously we had no way to close a stream(hence underlying table builder was in memory until flush is called on steam writer). Withstream done
support in BadgerV2.0, we can closed stream once we are done writing a predicate.This change can help us reduce memory usage when we have large number of predicates.
I did benchmarking by creating RDF file using below program:
Above program creates a RDF file with
10000
predicates and each predicate has10000
records for it(Total 100M records). I ran bulk loader using below command with empty schema.dgraph_master bulk -f file-10000-10000.rdf.gz -s empty.schema
System configuration: 16 core, 64GB RAM
Here are my observations:
Current Master:
Map phase peak memory usage(RSS): ~6GB
Reduce phase peak memory usage(RSS): ~20GB
Time to complete: 4min31sec
This PR:
Map phase peak memory usage(RSS): ~6GB
Reduce phase peak memory usage(RSS): ~7.2GB
Time to complete: 4min16sec
This change is