Refactor reindexing code to only reindex specific tokenizers. #2883

martinmr · 2019-01-09T01:44:23Z

Currently during reindexing, the whole index is being deleted and
rebuilt, even if there are tokenizers that did not change between
schemas. This change allows reindexing to happen only for tokenizers
that need it.

This change is

Currently during reindexing, the whole index is being deleted and rebuilt, even if there are tokenizers that did not change between schemas. This change allows reindexing to happen only for tokenizers that need it.

posting/index.go

manishrjain

Reviewable status: 0 of 4 files reviewed, 9 unresolved discussions (waiting on @golangcibot, @martinmr, and @manishrjain)

posting/index.go, line 81 at r1 (raw file):

// convenience as manually getting the tokenizers in every test that need them
// would add a lot of extra code.
func indexTokens(attr, lang string, src types.Val) ([]string, error) {

Put this code in the _test.go file.

posting/index.go, line 91 at r1 (raw file):

// original uid -> value edge.
// TODO - See if we need to pass op as argument as t should already have Op.
func (txn *Txn) addIndexMutationsForTokenizers(ctx context.Context, tokenizers []tok.Tokenizer,

Revert naming change.

posting/index.go, line 92 at r1 (raw file):

// TODO - See if we need to pass op as argument as t should already have Op.
func (txn *Txn) addIndexMutationsForTokenizers(ctx context.Context, tokenizers []tok.Tokenizer,
	t *pb.DirectedEdge, p types.Val, op pb.DirectedEdge_Op) error {

These args are going out of hands. Let's create a struct, which can house these, so we can cut down the args to 2 (ctx and obj). Also, no need to rename the functions.

posting/index.go, line 96 at r1 (raw file):

	uid := t.Entity
	x.AssertTrue(uid != 0)
	tokens, err := indexTokensForTokenizers(attr, t.GetLang(), tokenizers, p)

This can take the obj as described above.

posting/index.go, line 124 at r1 (raw file):

	// Get the complete list of tokenizers from the schema.
	tokenizers := schema.State().Tokenizer(t.Attr)
	return txn.addIndexMutationsForTokenizers(ctx, tokenizers, t, p, op)

If the struct doesn't have tokenizers specified (empty), then you can automatically do this. No need to create a helper func.

posting/index.go, line 598 at r1 (raw file):

}

type indexRebuildInfo struct {

This seems like it could be housed in IndexRebuild struct?

posting/index.go, line 604 at r1 (raw file):

}

func needsIndexRebuild(old *pb.SchemaUpdate, current *pb.SchemaUpdate) indexRebuildInfo {

Make this a func of IndexRebuild. Call it calculateTokenizers or calculateActions or something.

posting/index.go, line 677 at r1 (raw file):

		}
	}
	return indexRebuildInfo{

Can define this above. So, you don't need to define tokenizersToDelete, etc. Also, make([]string, 0) has no benefit over var x []string, latter is natural way to write it.

…-by-tokenizer

posting/index.go

…-by-tokenizer

martinmr

Reviewable status: 0 of 4 files reviewed, 12 unresolved discussions (waiting on @golangcibot, @manishrjain, and @gitlw)

posting/index.go, line 81 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Put this code in the _test.go file.

Done.

posting/index.go, line 81 at r1 (raw file):

Previously, gitlw (Lucas Wang) wrote…

It seems this is only being inside the index_test.go file, can we move this function into that test file?

Done.

posting/index.go, line 91 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Revert naming change.

Done.

posting/index.go, line 92 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

These args are going out of hands. Let's create a struct, which can house these, so we can cut down the args to 2 (ctx and obj). Also, no need to rename the functions.

Done.

posting/index.go, line 96 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This can take the obj as described above.

Done.

posting/index.go, line 120 at r1 (raw file):

Previously, gitlw (Lucas Wang) wrote…

To better distinguish this method from the following one, maybe it can be renamed to addIndexMutationsForAllTokenizers?

I ended up removing this special case so there is only one method called addIndexMutations. See comment below.

posting/index.go, line 124 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

If the struct doesn't have tokenizers specified (empty), then you can automatically do this. No need to create a helper func.

I don't think this is a good idea since we might pass an empty list of tokenizers by mistake and end up rebuilding the entire index. But I agree that the special function is not great so I've removed it. Now the callers are responsible for setting the list of tokenizers.

posting/index.go, line 437 at r1 (raw file):

Previously, gitlw (Lucas Wang) wrote…

rename to deleteIndexForAllTokenizers?

Renamed to deleteEntireIndex which is a bit shorter.

posting/index.go, line 598 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This seems like it could be housed in IndexRebuild struct?

Not really. This struct is only being used for normal indices (not count or reverse) so I think it would be confusing to put it there. Also, it's generated in a different place than the IndexRebuild struct.

posting/index.go, line 604 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Make this a func of IndexRebuild. Call it calculateTokenizers or calculateActions or something.

I made all the need... functions a function of IndexRebuild. I kept the name for consistency since it's still determining whether the rebuild needs to happen at all and sometimes it does nothing with the tokenizers so that would be confusing.

posting/index.go, line 664 at r1 (raw file):

Previously, gitlw (Lucas Wang) wrote…

If we change the prevTokens and currTokens from map[string]bool to map[string]struct{} (since they are essentially sets), then we can reuse the x.Diff method to get the differences.

Done.

posting/index.go, line 677 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Can define this above. So, you don't need to define tokenizersToDelete, etc. Also, make([]string, 0) has no benefit over var x []string, latter is natural way to write it.

Done.

manishrjain

Reviewed 2 of 4 files at r1, 1 of 2 files at r2, 2 of 2 files at r3.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @golangcibot, @martinmr, and @gitlw)

posting/index.go, line 124 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

I don't think this is a good idea since we might pass an empty list of tokenizers by mistake and end up rebuilding the entire index. But I agree that the special function is not great so I've removed it. Now the callers are responsible for setting the list of tokenizers.

Rebuilding the entire index is the current behavior. If that's the worst case scenario, that's not a bad tradeoff to achieve code simplicity.

posting/index.go, line 437 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Renamed to deleteEntireIndex which is a bit shorter.

I renamed it to deleteAllTokens and below one to deleteTokensFor(attr, tokenizer)

martinmr

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @golangcibot, @manishrjain, and @gitlw)

posting/index.go, line 124 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Rebuilding the entire index is the current behavior. If that's the worst case scenario, that's not a bad tradeoff to achieve code simplicity.

We still would need to pass the list of tokenizers in the arguments (even if it's nil) so I don't think this would reduce the complexity by that much. Right now this function is only called a couple of times so having the caller get the tokenizers is not that much trouble.

posting/index.go, line 437 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I renamed it to deleteAllTokens and below one to deleteTokensFor(attr, tokenizer)

Thanks.

…-by-tokenizer

Currently during reindexing, the whole index is being deleted and rebuilt, even if there are tokenizers that did not change between schemas. This change allows reindexing to happen only for tokenizers that need it.

…odeinc#2883) Currently during reindexing, the whole index is being deleted and rebuilt, even if there are tokenizers that did not change between schemas. This change allows reindexing to happen only for tokenizers that need it.

Refactor reindexing code to only reindex specific tokenizers.

36e8b76

Currently during reindexing, the whole index is being deleted and rebuilt, even if there are tokenizers that did not change between schemas. This change allows reindexing to happen only for tokenizers that need it.

martinmr requested a review from manishrjain January 9, 2019 01:44

golangcibot reviewed Jan 9, 2019

View reviewed changes

posting/index.go Outdated Show resolved Hide resolved

manishrjain suggested changes Jan 10, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into martinmr/reindexing…

7e72201

…-by-tokenizer

gitlw reviewed Jan 10, 2019

View reviewed changes

posting/index.go Outdated Show resolved Hide resolved

posting/index.go Outdated Show resolved Hide resolved

posting/index.go Outdated Show resolved Hide resolved

posting/index.go Outdated Show resolved Hide resolved

martinmr added 2 commits January 10, 2019 16:38

Merge remote-tracking branch 'origin/master' into martinmr/reindexing…

43eed95

…-by-tokenizer

Address review comments.

47f9c11

martinmr commented Jan 11, 2019

View reviewed changes

martinmr mentioned this pull request Jan 14, 2019

Improve tokenization system #2828

Closed

Manish Review: Rename variables

fe38f25

manishrjain approved these changes Jan 15, 2019

View reviewed changes

martinmr commented Jan 15, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into martinmr/reindexing…

2648a78

…-by-tokenizer

martinmr merged commit de73f40 into master Jan 15, 2019

martinmr deleted the martinmr/reindexing-by-tokenizer branch January 15, 2019 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor reindexing code to only reindex specific tokenizers. #2883

Refactor reindexing code to only reindex specific tokenizers. #2883

martinmr commented Jan 9, 2019 •

edited by manishrjain

Loading

manishrjain left a comment

martinmr left a comment

manishrjain left a comment

martinmr left a comment

Refactor reindexing code to only reindex specific tokenizers. #2883

Refactor reindexing code to only reindex specific tokenizers. #2883

Conversation

martinmr commented Jan 9, 2019 • edited by manishrjain Loading

manishrjain left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

martinmr commented Jan 9, 2019 •

edited by manishrjain

Loading