Fuzzy match support #2916

srfrog · 2019-01-19T03:21:55Z

This PR adds support for fuzzy matching to root queries and filters. It requires a trigram index on the predicate.

A new function match() that uses trigram index to make fuzzy comparisons.
Fuzzy terms are characters compared in order.

Examples:

// schema:
title: string @index(trigram) .

// max Levenshtein distance is 2
{
  q(func:match(title, "film", 2)) {
    title
  }
}

No breaking changes are expected.

Closes #1883

This change is

…d code cleanups

…ue-1883_fuzzy_match_support_in_dgraph

srfrog

Reviewable status: 0 of 12 files reviewed, 7 unresolved discussions (waiting on @martinmr)

systest/queries_test.go, line 579 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Move this down? Ok

Done.

tok/tokens.go, line 50 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Because the caller carries the cost of the check, and typically we send a slice of arguments. But there's room for an optimization here.

I removed this func and added a general GetTokens func that uses tokenizer ID, and the args are variadic so don't need to send a slice. In another change I will refactor the code to use it as needed.

worker/tokens.go, line 76 at r2 (raw file):

Previously, srfrog (Gus) wrote…

actually I'll get rid of the switch and use if. thanks.

Done.

martinmr

Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @srfrog)

systest/queries_test.go, line 579 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Done.

Sorry, I meant having the code like

	tests := []struct {in, out, failure string} {

cause otherwise there's a line with just }{

srfrog

Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @martinmr)

systest/queries_test.go, line 579 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Sorry, I meant having the code like
	tests := []struct {in, out, failure string} {
cause otherwise there's a line with just }{

ah i see. i leave like that for clarity and consistency with the other tests.

…actor, which would cause early termination of the algo to save CPU resources. Remove the fuzzysearch lib.

This change allows setting a second integer argument in match() to set the max Levenshtein distance threshold. If no value is set, the default value of 8 is used.

Updated test for new matchFuzzy using threshold.

manishrjain

Few more changes required.

Reviewed 3 of 8 files at r2, 3 of 6 files at r3, 1 of 6 files at r4, 3 of 3 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @martinmr and @srfrog)

vendor/vendor.json, line 655 at r6 (raw file):

		{
			"checksumSHA1": "9fTIdD63nJT3Y4QvHtw9dCBhzzE=",
			"path": "github.com/lithammer/fuzzysearch/fuzzy",

Needs to be removed.

worker/task.go, line 1195 at r6 (raw file):

			// convert data from binary to appropriate format
			strVal, err := types.Convert(val, types.StringID)
			if err == nil && matchFuzzy(matchQuery, strVal.Value.(string), int(arg.srcFn.threshold)) {

Convert the threshold to int once, and just use for all the uids.

worker/task.go, line 1507 at r6 (raw file):

		l := len(q.SrcFunc.Args)
		if l == 0 || l > 2 {
			return nil, x.Errorf("Function '%s' requires at most 2 arguments, but got %d (%v)",

at most? Isn't it always 2 args?

worker/task.go, line 1516 at r6 (raw file):

		fc.intersectDest = needsIntersect(f)
		// Max Levenshtein distance
		fc.threshold = 8

Don't add artificial limits. Remove this.

Note that the matching is only done on lists which have been returned by the trigram index. So, we're not matching against the universal data set. Additionally, the cost of reading these lists from the disk is a lot more than running a limited computation in memory.

In the worst case, we're returning all the uids that were returned by the trigram index, which is a manageable list.

worker/task.go, line 1524 at r6 (raw file):

				return nil, x.Errorf("Levenshtein distance value must be an int, got %v", s)
			}
			if max > 0 && max < 8 {

if less than zero, return an error.

srfrog

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @manishrjain and @martinmr)

vendor/vendor.json, line 655 at r6 (raw file):