Live & bulk loader changes #2961

codexnull · 2019-02-01T01:50:16Z

Fixes #2889, #2848, #2927.

Summary of loader changes:

Support live loading JSON files
Support creating UID on the fly based on one or more other JSON fields in both loaders
Support loading RDF or JSON stream instead of requiring files in live loader
Auto-detect compressed load data instead of requiring extension in filename in both loaders
Auto-detect JSON load data instead of requiring extension in filename in both loaders
Lots of refactoring of live and bulk loaders to share code for the same functionality

This change is

…les there.

…ecessary.

…generic test data filenames.

… name. Improve live loader messages.

golangcibot · 2019-02-01T01:51:13Z

z/exec.go

+			}
+		}
+
+		cmd[i].Start()


Error return value of (*os/exec.Cmd).Start is not checked (from errcheck)

martinmr

Reviewed 21 of 21 files at r1.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @codexnull and @golangcibot)

dgraph/cmd/bulk/loader.go, line 40 at r1 (raw file):

	"github.com/dgraph-io/dgraph/x"
	"github.com/dgraph-io/dgraph/xidmap"
	"google.golang.org/grpc"

Why is this line being moved?

dgraph/cmd/bulk/loader.go, line 69 at r1 (raw file):

	shardOutputDirs []string
	keyFields       []string
}

If you are going to have two fields that look the same, there should probably a comment explaining the difference to avoid confusion in the future.

Personally, if I understand the difference, I would call keyFields something like parsedKeyFields.

dgraph/cmd/live/load_json_test.go, line 39 at r1 (raw file):

var testDataDir string
var dg *dgo.Dgraph
var tmpDir string

move these into a var block

dgraph/cmd/live/run.go, line 69 at r1 (raw file):

var Live x.SubCommand

var keyFields []string

also move all these vars into a var block

dgraph/cmd/live/run.go, line 174 at r1 (raw file):

}

// forward file to the RDF or JSON processor as appropriate

change it to "processFile forwards file to the ...."

edgraph/nquads_from_json.go, line 422 at r1 (raw file):

func replaceBlankIds(nqs []*api.NQuad, keyFields *[]string) error {
	field2idx := make(map[string]int)

I think this would be a bit more readable if it was called fieldToIdx

edgraph/nquads_from_json.go, line 450 at r1 (raw file):

	if err == nil {
		if nqs[0].Subject == "_:blank-0" {
			if keyFields == nil {

maybe assert this at the very beginning and exit early if keyFields == nil?

loadfile/chunk.go, line 62 at r1 (raw file):

func (rdfChunker) Begin(r *bufio.Reader) error {
	return nil

Add a comment on why this chunker just returns nil.

loadfile/chunk.go, line 68 at r1 (raw file):

	batch.Grow(1 << 20)
	for lineCount := 0; lineCount < 1e5; lineCount++ {

maybe assign these constants to a variable and explain why these values were chosen.

loadfile/chunk.go, line 122 at r1 (raw file):

func (rdfChunker) End(r *bufio.Reader) error {
	return nil

a comment explaining why this function is empty would also be helpful

loadfile/chunk.go, line 145 at r1 (raw file):

func (jsonChunker) Chunk(r *bufio.Reader) (*bytes.Buffer, error) {
	out := new(bytes.Buffer)
	out.Grow(1 << 20)

Also this constant but I assume since it has the same value as above it represents the same quantity.

z/exec.go, line 67 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of (*os/exec.Cmd).Start is not checked (from errcheck)

Fix this

z/json.go, line 91 at r1 (raw file):

func sortJSONArray(a []interface{}) uint64 {
	h := uint64(0)

var h uint64

codexnull

Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @golangcibot and @martinmr)

dgraph/cmd/bulk/loader.go, line 40 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Why is this line being moved?

Because the automatic formatting in my editor decided to do it. Sometimes it rearranges imports and I don't always catch it.

dgraph/cmd/bulk/loader.go, line 69 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

If you are going to have two fields that look the same, there should probably a comment explaining the difference to avoid confusion in the future.

Personally, if I understand the difference, I would call keyFields something like parsedKeyFields.

Done.

dgraph/cmd/live/load_json_test.go, line 39 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

var testDataDir string
var dg *dgo.Dgraph
var tmpDir string
move these into a var block

OK. Is that just because it's the preferred Go style?

dgraph/cmd/live/run.go, line 69 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

also move all these vars into a var block

Done.

dgraph/cmd/live/run.go, line 174 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

change it to "processFile forwards file to the ...."

Done.

edgraph/nquads_from_json.go, line 422 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

I think this would be a bit more readable if it was called fieldToIdx

Done.

edgraph/nquads_from_json.go, line 450 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

maybe assert this at the very beginning and exit early if keyFields == nil?

It's OK for keyFields to be nil most of the time. Only when a JSON map does not contain a uid field (which is what the "_:blank-0" check is for) is it required.

loadfile/chunk.go, line 62 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Add a comment on why this chunker just returns nil.

Done.

loadfile/chunk.go, line 68 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

	batch.Grow(1 << 20)
	for lineCount := 0; lineCount < 1e5; lineCount++ {
maybe assign these constants to a variable and explain why these values were chosen.

Those values were there already, but my guess is they are arbitrary.

loadfile/chunk.go, line 122 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

a comment explaining why this function is empty would also be helpful

Done.

loadfile/chunk.go, line 145 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Also this constant but I assume since it has the same value as above it represents the same quantity.

Yes, this I just copied for consistency.

z/json.go, line 91 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

var h uint64

Done.

martinmr

Reviewed 7 of 7 files at r2.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @codexnull, @golangcibot, and @martinmr)

dgraph/cmd/live/load_json_test.go, line 39 at r1 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

OK. Is that just because it's the preferred Go style?

I believe so

loadfile/chunk.go, line 61 at r2 (raw file):

}

// RDF files don't require any special processing at the beginning of the file

period at the end

loadfile/chunk.go, line 122 at r2 (raw file):

}

// RDF files don't require any special processing at the end of the file

period at the end

z/json.go, line 91 at r2 (raw file):

func sortJSONArray(a []interface{}) uint64 {
	h := uint64(0)

var h uint64 here as well

manishrjain

Reviewed 14 of 21 files at r1, 7 of 7 files at r2.
Reviewable status: all files reviewed, 30 unresolved discussions (waiting on @codexnull, @golangcibot, and @martinmr)

dgraph/cmd/bulk/loader.go, line 204 at r2 (raw file):

			defer thr.Done()

			r, cleanup_fn := x.FileReader(file)

r, cleanup := x.FileReader(file)

defer cleanup()

dgraph/cmd/bulk/mapper.go, line 120 at r2 (raw file):

}

func (m *mapper) run(inputFormat int, keyFields *[]string) {

Slices are pointers by default. No need to pass in a pointer to a slice.

If you're changing the slice, then the Go way is to return the updated slice as a result.

dgraph/cmd/bulk/mapper.go, line 135 at r2 (raw file):

			}

			gqlNqs := make([]gql.NQuad, len(nqs))

What is this for? Are you copying them over for some reason?

dgraph/cmd/bulk/run.go, line 51 at r2 (raw file):

	flag := Bulk.Cmd.Flags()
	flag.StringP("rdfs", "r", "",
		"Location of RDF data to load.")

I think the previous comment was clearer than this one.

dgraph/cmd/bulk/run.go, line 54 at r2 (raw file):

	// would be nice to use -j to match -r, but already used by --num_go_routines
	flag.String("jsons", "",
		"Location of JSON data to load.")

same here. This isn't as clear -- unless you changed how we detect files.

dgraph/cmd/bulk/run.go, line 94 at r2 (raw file):

	flag.String("custom_tokenizers", "",
		"Comma separated list of tokenizer plugins")
	flag.StringP("key", "k", "", "Comma-separated list of JSON fields to identify a uid")

Add (Optional) prefix in the description.

... to identify a node uniquely.

dgraph/cmd/bulk/run.go, line 184 at r2 (raw file):

	for _, f := range strings.Split(opt.KeyFields, ",") {
		opt.parsedKeyFields = append(opt.parsedKeyFields, strings.TrimSpace(f))

We should avoid having 2 fields in opt, which both represent the same information. You can just overwrite opt.KeyFields here.

dgraph/cmd/live/run.go, line 85 at r2 (raw file):

	flag := Live.Cmd.Flags()
	flag.StringP("rdfs", "r", "", "Location of RDF or JSON file(s) to load")

Rename to files, and f. What do you expect the pattern to be? like this:

/*.json.gz? Whatever it is, mention it in the description so it is clear. ___ *[dgraph/cmd/live/run.go, line 179 at r2](https://reviewable.io/reviews/dgraph-io/dgraph/2961#-LXuWoxX2EeASxWdBWgZ:-LXuWoxX2EeASxWdBWg_:bvbuzff) ([raw file](https://github.com/dgraph-io/dgraph/blob/116a7943ec9e9618ea97d5f93a0eaad25152b3d6/dgraph/cmd/live/run.go#L179)):* > ```Go > > rd, cleanup_fn := x.FileReader(file) > defer cleanup_fn() > ```

Go uses camelcase. But, as mentioned above, no need for _fn. Just call it cleanup. Similar to how context.WithCancel works.

dgraph/cmd/live/run.go, line 189 at r2 (raw file):

	} else {
		isJson, err = x.IsJSONData(rd)
		if isJson {

Shouldn't you check for error first? But, also, why try to guess? Just complain and crash out. Users' responsibility to give us the right files.

dgraph/cmd/live/run.go, line 199 at r2 (raw file):

}

func (l *loader) nextNquads(batch []*api.NQuad, nqs []*api.NQuad) []*api.NQuad {

Typically these things just get inlined into the loops, instead of being funcs on their own.

dgraph/cmd/live/run.go, line 211 at r2 (raw file):

		mu := api.Mutation{Set: batch[:opt.batchSize]}
		l.reqs <- mu
		batch = batch[opt.batchSize:]

I smell some funkiness here. The way batch is being passed around doesn't look good. Yeah, just inline this in the caller.

Also, after passing batch to a channel, create a new batch, don't modify it anymore. This particular case might just work, but in general, a recipe for disaster.

dgraph/cmd/live/run.go, line 217 at r2 (raw file):

}

func (l *loader) finalNquads(batch []*api.NQuad) {

Inline this func.

dgraph/cmd/live/run.go, line 253 at r2 (raw file):

}

func (l *loader) processJsonFile(ctx context.Context, rd *bufio.Reader) error {

Not a fan of 2-line functions. Rather inline it and avoid the code jumps required to read this code.

dgraph/cmd/live/run.go, line 258 at r2 (raw file):

}

func (l *loader) processRdfFile(ctx context.Context, rd *bufio.Reader) error {

And here.

dgraph/cmd/live/run.go, line 325 at r2 (raw file):

	for _, f := range strings.Split(opt.keyFields, ",") {
		keyFields = append(keyFields, strings.TrimSpace(f))

Some other place, this was being set back into the opt? (looks like in bulk) But comments there apply here as well. Overwrite opt.keyFields, decrease a variable.

edgraph/nquads_from_json.go, line 437 at r2 (raw file):

	for _, nq := range nqs {
		nq.Subject = str

This won't work if another node is pointing to this node. The ObjectId which was a variable before would have been correctly set by xidToUid map, but now the subjects get changed, leaving those pointers dangling.

loadfile/chunk.go, line 62 at r1 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

Done.

I don't see the changes required by Martin.

loadfile/chunk.go, line 17 at r2 (raw file):

 */

package loadfile

package chunker.

Also, move rdf parsing within this package.

loadfile/chunk.go, line 214 at r2 (raw file):

	}

	nqs, err := edgraph.JsonToNquads(chunkBuf.Bytes(), keyFields)

Move it to this package.

x/file.go, line 105 at r2 (raw file):

// and decompressed automatically even without the gz extension. The caller is responsible for
// calling the returned cleanup function when done with the reader.
func FileReader(file string) (rd *bufio.Reader, cleanup func()) {

Move this to chunker pkg.

z/client.go, line 2 at r2 (raw file):

/*
 * Copyright 2017-2018 Dgraph Labs, Inc. and Contributors

2019

z/exec.go, line 2 at r2 (raw file):

/*
 * Copyright 2017-2018 Dgraph Labs, Inc. and Contributors

2019

z/json.go, line 2 at r2 (raw file):

/*
 * Copyright 2017-2018 Dgraph Labs, Inc. and Contributors

2019

golangcibot · 2019-02-05T01:57:25Z

dgraph/cmd/bulk/mapper.go

@@ -139,7 +140,7 @@ func (m *mapper) run(inputFormat int) {
 					}
 				}

-				m.processNQuad(nq)
+				m.processNQuad(gql.NQuad{nq})


gql.NQuad composite literal uses unkeyed fields (from govet)

codexnull

Reviewable status: 12 of 21 files reviewed, 31 unresolved discussions (waiting on @codexnull, @golangcibot, @manishrjain, and @martinmr)

dgraph/cmd/bulk/loader.go, line 204 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

r, cleanup := x.FileReader(file)

defer cleanup()

Done.

dgraph/cmd/bulk/mapper.go, line 120 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Slices are pointers by default. No need to pass in a pointer to a slice.

If you're changing the slice, then the Go way is to return the updated slice as a result.

Done.

dgraph/cmd/bulk/mapper.go, line 135 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

What is this for? Are you copying them over for some reason?

chunker.Parse() returns a slice of *api.NQuad, but the call to mapper.processNquad() below needs a gql.NQuad, so this code was converting from a slice of *api.NQuad to a slice of gql.NQuad. Looking at the code again, though, it's not necessary to create an array of gql.NQuad; each one can be just converted at the mapper.processNquad() invocation.

dgraph/cmd/bulk/mapper.go, line 143 at r3 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

gql.NQuad composite literal uses unkeyed fields (from govet)

Done.

dgraph/cmd/bulk/run.go, line 51 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I think the previous comment was clearer than this one.

The reason I changed the wording is because for the live loader, the argument isn't necessarily a directory. It may be the rdf file directly, a comma-separated list of file, or a directory containing the rdf files. I changed the bulk loader to behave the same way as the live loader.

dgraph/cmd/bulk/run.go, line 54 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

same here. This isn't as clear -- unless you changed how we detect files.

The argument here may also be a file, a list of files, or a directory.

dgraph/cmd/bulk/run.go, line 94 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add (Optional) prefix in the description.

... to identify a node uniquely.

Aren't all options optional? :)

Removing the --key option since the feature requires more thought.

dgraph/cmd/bulk/run.go, line 184 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

We should avoid having 2 fields in opt, which both represent the same information. You can just overwrite opt.KeyFields here.

I don't think we can do that given that opt.KeyFields is the string given as the option argument (e.g. "last,first,middle") and opt.parsedKeyFields is a slice of the individual fields (e.g. ["last","first","middle"]), but I'll see if I can move parsedKeyFields out of opt to some other (non-global variable) spot.

dgraph/cmd/live/run.go, line 85 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Rename to files, and f. What do you expect the pattern to be? like this:
/*.json.gz? Whatever it is, mention it in the description so it is clear.

I kept --rdfs because I didn't want to break backward compatibility. Should we keep it and deprecate it? I will change the option to --files in the bulk loader so that they are consistent.

dgraph/cmd/live/run.go, line 179 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Go uses camelcase. But, as mentioned above, no need for _fn. Just call it cleanup. Similar to how context.WithCancel works.

Done.

dgraph/cmd/live/run.go, line 189 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Shouldn't you check for error first? But, also, why try to guess? Just complain and crash out. Users' responsibility to give us the right files.

I check isJson first because it can only be true if err is nil and I wanted to let errors fall through to the return at the end of the function, but I'll change it to check err since that seems to be the preferred go idiom.

As for detecting JSON data, the reason is so that the user can do something like "curl | dgraph live ..." to stream data in (that is, there's no filename to check for an extension). Would it be better to make the user specify something like --format=json instead?

dgraph/cmd/live/run.go, line 199 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Typically these things just get inlined into the loops, instead of being funcs on their own.

OK

dgraph/cmd/live/run.go, line 211 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I smell some funkiness here. The way batch is being passed around doesn't look good. Yeah, just inline this in the caller.

Also, after passing batch to a channel, create a new batch, don't modify it anymore. This particular case might just work, but in general, a recipe for disaster.

batch is not a mutation, it's just a regular slice that holds a queue of nquads. When there are batchSize or more nquads in the queue, then the mutation is created, the leading batchSize items are copied to the mutation and removed from the queue, and the mutation is sent off through the channel and forgotten about. I think the queue will be unnecessary when I inline the functions

dgraph/cmd/live/run.go, line 253 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Not a fan of 2-line functions. Rather inline it and avoid the code jumps required to read this code.

Done.

dgraph/cmd/live/run.go, line 258 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

And here.

Done.

dgraph/cmd/live/run.go, line 325 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Some other place, this was being set back into the opt? (looks like in bulk) But comments there apply here as well. Overwrite opt.keyFields, decrease a variable.

Yes, the bulk and live loaders have different opt structures. Removing this feature since it requires more thought.

edgraph/nquads_from_json.go, line 437 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This won't work if another node is pointing to this node. The ObjectId which was a variable before would have been correctly set by xidToUid map, but now the subjects get changed, leaving those pointers dangling.

Removing this feature since it requires more thought.

loadfile/chunk.go, line 61 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

period at the end

Done.

loadfile/chunk.go, line 122 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

period at the end

Done.

z/exec.go, line 67 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Fix this

Done.

z/json.go, line 91 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

var h uint64 here as well

Done.

golangcibot · 2019-02-06T02:03:47Z

dgraph/cmd/bulk/mapper.go

@@ -31,6 +31,7 @@ import (
 	"sync/atomic"

 	"github.com/dgraph-io/dgo/protos/api"
+	"github.com/dgraph-io/dgraph/chunker"


File is not goimports-ed (from goimports)

…f files.

codexnull

Reviewable status: 10 of 30 files reviewed, 32 unresolved discussions (waiting on @golangcibot, @manishrjain, and @martinmr)

dgraph/cmd/bulk/mapper.go, line 135 at r2 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

chunker.Parse() returns a slice of *api.NQuad, but the call to mapper.processNquad() below needs a gql.NQuad, so this code was converting from a slice of *api.NQuad to a slice of gql.NQuad. Looking at the code again, though, it's not necessary to create an array of gql.NQuad; each one can be just converted at the mapper.processNquad() invocation.

Done.

dgraph/cmd/bulk/mapper.go, line 34 at r4 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

File is not goimports-ed (from goimports)

Done.

dgraph/cmd/bulk/run.go, line 184 at r2 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

I don't think we can do that given that opt.KeyFields is the string given as the option argument (e.g. "last,first,middle") and opt.parsedKeyFields is a slice of the individual fields (e.g. ["last","first","middle"]), but I'll see if I can move parsedKeyFields out of opt to some other (non-global variable) spot.

Removing the --key option since the feature requires more thought.

dgraph/cmd/live/run.go, line 85 at r2 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

I kept --rdfs because I didn't want to break backward compatibility. Should we keep it and deprecate it? I will change the option to --files in the bulk loader so that they are consistent.

Done.

dgraph/cmd/live/run.go, line 217 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Inline this func.

Done.

x/file.go, line 105 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Move this to chunker pkg.

Done.

z/client.go, line 2 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

2019

Done.

z/exec.go, line 2 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

2019

Done.

z/json.go, line 2 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

2019

Done.

loadfile/chunk.go, line 17 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

package chunker.

Also, move rdf parsing within this package.

Done.

loadfile/chunk.go, line 214 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Move it to this package.

Done.

…nnel.

manishrjain

Made and pushed a small change to fix an issue. Good to go.

Reviewed 3 of 9 files at r3, 8 of 15 files at r4, 12 of 13 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @codexnull, @golangcibot, @manishrjain, and @martinmr)

dgraph/cmd/live/run.go, line 189 at r2 (raw file):

Previously, codexnull (Javier Alvarado) wrote…

I check isJson first because it can only be true if err is nil and I wanted to let errors fall through to the return at the end of the function, but I'll change it to check err since that seems to be the preferred go idiom.

As for detecting JSON data, the reason is so that the user can do something like "curl | dgraph live ..." to stream data in (that is, there's no filename to check for an extension). Would it be better to make the user specify something like --format=json instead?

OK.

dgraph/cmd/live/run.go, line 221 at r5 (raw file):

			batch = append(batch, nqs...)
			for len(batch) >= opt.batchSize {
				mu := api.Mutation{Set: batch[:opt.batchSize]}

send the whole batch out, and create a new batch. Don't modify a slice after its ownership has changed. I've pushed a change to this

edgraph/server.go, line 35 at r5 (raw file):

	"github.com/dgraph-io/dgo/y"

	nqjson "github.com/dgraph-io/dgraph/chunker/json"

cjson? short of chunker/json?

Fixes dgraph-io#2889, dgraph-io#2927. Summary of loader changes: * Support live loading JSON files * Support loading RDF or JSON stream instead of requiring files in live loader * Auto-detect compressed load data instead of requiring extension in filename in both loaders * Auto-detect JSON load data instead of requiring extension in filename in both loaders * Lots of refactoring of live and bulk loaders to share code for the same functionality

Javier Alvarado added 30 commits January 11, 2019 16:11

Accept directory as argument to --rdfs and look for .rdf and .json fi…

6145526

…les there.

Merge branch 'master' into javier/issue2889_live_load_json

6e395be

Add x.FileReader() function to return a reader that decompresses if n…

98e0ea1

…ecessary.

Forward file to the RDF or JSON processor as appropriate.

3ca188a

Working version?

de3da8f

Add live test stub.

c4291db

Ensure pending nquads are submitted after EOF.

5866bd7

Change references to RDFs to N-Quads to be more correct.

86c421f

Merge master.

9cb6dca

Export TMPDIR in test.sh so that hopefully tests will use it. Change …

bdfcd85

…generic test data filenames.

Change test data schema to reflect changes merged from master.

1e9c530

Live JSON load test.

c495a01

Auto-detect JSON in load files with no .rdf or .json extension in the…

f499943

… name. Improve live loader messages.

Fix bug reading streamed JSON. Add test of loading streamed JSON.

7b5ad3b

Live load testing improvements.

405c60a

First attempt at autogenerating uid from key fields.

37c3437

Replace RDFs in message with N-Quads.

afe44d4

Add debugging code.

9da9d8e

Fix bug assigning blank ids. Move more testing code to testing package.

fce4ff6

Add test of live loading multiple JSON files.

fbad4cf

Add test of live loading JSON without UID field. Testing improvements.

8c073d5

Add tests of live JSON load with auto-uid.

f828d57

Merge branch 'master' into javier/issue2889_live_load_json

44b1a73

Small restartCluster fix.

cc59eed

Merge master.

f6b25ca

Rename fields to be more correct or specific.

06fd748

Improved comments.

a23a169

Remove test.sh change that should be part of another branch.

4495702

Fix bugs catching errors.

2a94576

Make auto-added blank uid fields per file.

c9833fd

Javier Alvarado added 2 commits January 31, 2019 14:53

Add --key option to bulk loader for parity with live loader.

17125d9

Fix batching, which was broken by live/bulk refactoring.

8e7b581

codexnull requested a review from a team February 1, 2019 01:50

golangcibot reviewed Feb 1, 2019

View reviewed changes

z/exec.go Outdated

}

}

cmd[i].Start()

Copy link

golangcibot Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error return value of (*os/exec.Cmd).Start is not checked (from errcheck)

martinmr suggested changes Feb 1, 2019

View reviewed changes

Javier Alvarado added 2 commits February 1, 2019 13:34

Parse key fields earlier in the process.

1c2c9ef

Cleanup suggestions from PR.

116a794

codexnull commented Feb 1, 2019

View reviewed changes

martinmr suggested changes Feb 1, 2019

View reviewed changes

manishrjain suggested changes Feb 4, 2019

View reviewed changes

PR review fixes.

8f16e04

golangcibot reviewed Feb 5, 2019

View reviewed changes

Javier Alvarado added 2 commits February 4, 2019 18:01

Fix warning.

7685150

Remove --key support since it requires more thought.

dd237b1

codexnull commented Feb 6, 2019

View reviewed changes

Javier Alvarado added 5 commits February 5, 2019 16:40

Remove two-line functions processJsonFile and processRdfFile.

db7f34f

Inline nextNquads() and finalNquads() into processLoadFile().

915eef8

Rename package loadfile to chunker.

d09e925

Move package rdf to chunker/rdf.

94f355b

WIP...

9f39ac9

golangcibot reviewed Feb 6, 2019

View reviewed changes

Javier Alvarado added 3 commits February 5, 2019 23:25

Move json and rdf nquad parsing under chunker.

6872833

Move FileReader() and IsJSONData() from x package to chunker.

ec990e2

Rename live --rdfs option to --files since it can load more than *.rd…

ac41b58

…f files.

codexnull commented Feb 6, 2019

View reviewed changes

Fix a bug where batch slice was being modified after sending to a cha…

6143636

…nnel.

manishrjain approved these changes Feb 6, 2019

View reviewed changes

codexnull merged commit bcc7796 into master Feb 6, 2019

codexnull deleted the javier/issue2889_live_load_json branch March 1, 2019 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live & bulk loader changes #2961

Live & bulk loader changes #2961

codexnull commented Feb 1, 2019 •

edited

Loading

golangcibot Feb 1, 2019

martinmr left a comment

codexnull left a comment

martinmr left a comment

manishrjain left a comment

golangcibot Feb 5, 2019

codexnull left a comment

golangcibot Feb 6, 2019

codexnull left a comment

manishrjain left a comment

Live & bulk loader changes #2961

Live & bulk loader changes #2961

Conversation

codexnull commented Feb 1, 2019 • edited Loading

golangcibot Feb 1, 2019

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

codexnull left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

golangcibot Feb 5, 2019

Choose a reason for hiding this comment

codexnull left a comment

Choose a reason for hiding this comment

golangcibot Feb 6, 2019

Choose a reason for hiding this comment

codexnull left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

codexnull commented Feb 1, 2019 •

edited

Loading