Refactor bleve tokenizer usage #2738

srfrog · 2018-11-08T04:57:30Z

This PR refactors bleve term and fulltext tokenizers used in predicate indexes.

Adds support for base language stop tokens and stemmer filters of all bleve language tokenizers.
Deprecates the language index error for languages that don't have fulltext tokenizer.
Removes snowball dependency
Deprecates contrib script that generates the stopword dictionaries
Update bleve to the current stable version, while reducing its footprint in alpha.
Moderate performance gains for fulltext tokenizing.

Closes #2622
Closes #2601

Replaces #2602

This change is

removed language-specific fulltext indexes and using single one "fulltext". added proxy for language stop tokens that supports all bleve languages. added proxy for language stemmers that supports all bleve languages. saving state, still need to update index code and update vendor.

added some debugging logs.

updates tokenizers to use language tags. reduced bleve code use significantly. improved stop tokens and stemmer proxies. replaced lang aliases with language bases for better matching.

posting/index.go

fixed bug with term tokenizer.

added tokenizer language context. other minor cleanups and tweaks.

danielmai

Reviewed 1 of 1 files at r3.
Reviewable status: 1 of 165 files reviewed, 2 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)

wiki/content/query-language/index.md, line 371 at r3 (raw file):

Dgraph uses [bleve](https://github.com/blevesearch/bleve) for its full text search indexing. See also the bleve language specific [stop word lists](https://github.com/blevesearch/bleve/tree/master/analysis/lang).

Following table contains all supported languages, corresponding country-codes, stop words and stemming filtering support.

Mention "stemming" before "stop words" since that's the order they're presented in the table below.

manishrjain

Reviewable status: 1 of 165 files reviewed, 12 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)

posting/index.go, line 65 at r3 (raw file):

				"Consider switching to hash for better performance.\n", attr)
		}
		toks, err := tok.BuildTokens(sv.Value, it.SetLang(lang))

getLangTokenizer(t, lang) a helper func.

tok/bleve.go, line 91 at r3 (raw file):

}

func getFullTextTokens(s, lang string) ([]string, error) {

Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.

tok/langbase.go, line 55 at r3 (raw file):

		return lang
	}
	tag, _ := language.Parse(lang)

err

x.Check(err)

tok/langbase.go, line 57 at r3 (raw file):

	tag, _ := language.Parse(lang)
	if tag != language.Und { // something like "x-klingon" should retag to "en"
		if base, conf := tag.Base(); conf > language.No {

Add some comments explaining what's going on.

tok/tok.go, line 84 at r2 (raw file):

	registerTokenizer(MonthTokenizer{})
	registerTokenizer(DayTokenizer{})
	registerTokenizer(TermTokenizer{})

Where did TermTokenizer go?

tok/tok.go, line 119 at r2 (raw file):

func registerTokenizer(t Tokenizer) {
	if _, found := tokenizers[t.Name()]; found {
		glog.V(3).Infof("Duplicate tokenizer: %s", t.Name())

This should be an assert failure.

tok/tok.go, line 123 at r2 (raw file):

	}
	if _, ok := types.TypeForName(t.Type()); !ok {
		glog.V(3).Infof("Invalid type %q for tokenizer %s", t.Type(), t.Name())

Assert failure again.

tok/tok.go, line 136 at r2 (raw file):

	return types.IndexGeoTokens(v.(geom.T))
}
func (t GeoTokenizer) SetLang(string) Tokenizer { return t }

This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.

worker/stringfilter.go, line 110 at r3 (raw file):

	tokens, err := tok.BuildTokens(value.Value, tokenizer.SetLang(filter.lang))
	if err != nil {
		glog.V(3).Infof("error while building tokens: %s", err)

glog.Errorf(...)

worker/task.go, line 1271 at r2 (raw file):

			return nil, x.Errorf("Could not find tokenizer with name %q", tokerName)
		}
		fc.tokens, err = tok.BuildTokens(valToTok.Value, tokenizer.SetLang(langForFunc(q.Langs)))

getIfLang(...)

…k pkg. added caching to langbase and removed the top 20 lang list. removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func. updated docs and tests.

…up steps. TermTokernizer and FullTextTokenizer have all the Tokens code within. added small lookup table to langBase to cache known tags. removed Bleve filter pkgs stemmerx and stopx, using only funcs. changed schema.TokenizerNames to use existing code in schema.Tokenizer. fixed and added new tests and comments.

srfrog

Reviewable status: 0 of 167 files reviewed, 12 unresolved discussions (waiting on @danielmai, @golangcibot, and @manishrjain)

posting/index.go, line 40 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

maxBatchSize is unused

Done.

posting/index.go, line 65 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

getLangTokenizer(t, lang) a helper func.

Done.

tok/bleve.go, line 91 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.

Done.

tok/langbase.go, line 55 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

err

x.Check(err)

I checked and we can't use that because a syntax error is possible. And I think language syntax errors shouldnt terminate Dgraph, but instead use the default lang "en".

tok/langbase.go, line 57 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add some comments explaining what's going on.

Done.

tok/tok.go, line 84 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Where did TermTokenizer go?

I've put them back in tok.go, and do the bleve setup separately in bleve.go.

tok/tok.go, line 119 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This should be an assert failure.

Done.

tok/tok.go, line 123 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Assert failure again.

Done.

tok/tok.go, line 136 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.

Done.

wiki/content/query-language/index.md, line 371 at r3 (raw file):

Previously, danielmai (Daniel Mai) wrote…

Mention "stemming" before "stop words" since that's the order they're presented in the table below.

Done.

worker/stringfilter.go, line 110 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

glog.Errorf(...)

Done.

worker/task.go, line 1271 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

getIfLang(...)

Done.

…ue-2622_update_bleve_language_support

…hub.com:/dgraph-io/dgraph into srfrog/issue-2622_update_bleve_language_support

* removed old bleve deps. removed snowball. removed language-specific fulltext indexes and using single one "fulltext". added proxy for language stop tokens that supports all bleve languages. added proxy for language stemmers that supports all bleve languages. saving state, still need to update index code and update vendor. * fixed issue with stemmer overwrites. added some debugging logs. * changing index to use new fulltext tokenizer. * fixed wrong name for porter stemmer * fixed name of Danish stemmer. * fixed copy-n-paste typos. * refactored term and fulltext tokenizers. updates tokenizers to use language tags. reduced bleve code use significantly. improved stop tokens and stemmer proxies. replaced lang aliases with language bases for better matching. * fixed some tests. fixed bug with term tokenizer. * restored Tokenizer Tokens signature to not break tokernizer plugins. added tokenizer language context. other minor cleanups and tweaks. * updated docs to reflect the new language support for fulltext tokens. * simplified stopwords and stemmers proxies and folded them into the tok pkg. added caching to langbase and removed the top 20 lang list. removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func. updated docs and tests. * moved term and fulltext analyzers to global vars to remove extra lookup steps. TermTokernizer and FullTextTokenizer have all the Tokens code within. added small lookup table to langBase to cache known tags. removed Bleve filter pkgs stemmerx and stopx, using only funcs. changed schema.TokenizerNames to use existing code in schema.Tokenizer. fixed and added new tests and comments. * deprecating stopwords scrapping script as it is not needed. * updating vendor deps * fixed test * Manish Review * fixed dep issue * fixed test for new logic

srfrog added 7 commits November 6, 2018 23:47

fixed issue with stemmer overwrites.

06aa124

added some debugging logs.

changing index to use new fulltext tokenizer.

e9bb48a

fixed wrong name for porter stemmer

a575b57

fixed name of Danish stemmer.

137a569

fixed copy-n-paste typos.

8ab9e73

refactored term and fulltext tokenizers.

85e4779

updates tokenizers to use language tags. reduced bleve code use significantly. improved stop tokens and stemmer proxies. replaced lang aliases with language bases for better matching.

golangcibot reviewed Nov 8, 2018

View reviewed changes

posting/index.go Outdated Show resolved Hide resolved

srfrog added breaking_change in progress labels Nov 8, 2018

fixed some tests.

fefbe8d

fixed bug with term tokenizer.

srfrog requested a review from manishrjain November 8, 2018 05:17

restored Tokenizer Tokens signature to not break tokernizer plugins.

7bb4cb2

added tokenizer language context. other minor cleanups and tweaks.

srfrog removed breaking_change in progress labels Nov 8, 2018

updated docs to reflect the new language support for fulltext tokens.

8b5d34e

danielmai suggested changes Nov 9, 2018

View reviewed changes

manishrjain suggested changes Nov 9, 2018

View reviewed changes

srfrog added 2 commits November 9, 2018 10:28

simplified stopwords and stemmers proxies and folded them into the to…

e6c8e47

…k pkg. added caching to langbase and removed the top 20 lang list. removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func. updated docs and tests.

srfrog commented Nov 10, 2018

View reviewed changes

srfrog and others added 8 commits November 9, 2018 18:14

deprecating stopwords scrapping script as it is not needed.

3dee9f5

updating vendor deps

b882b0b

Merge branch 'master' of github.com:/dgraph-io/dgraph into srfrog/iss…

3a254ae

…ue-2622_update_bleve_language_support

fixed test

d29a05c

Manish Review

71a223e

fixed dep issue

af64067

Merge branch 'srfrog/issue-2622_update_bleve_language_support' of git…

247e4de

…hub.com:/dgraph-io/dgraph into srfrog/issue-2622_update_bleve_language_support

fixed test for new logic

5a20611

srfrog merged commit c43c050 into master Nov 16, 2018

srfrog deleted the srfrog/issue-2622_update_bleve_language_support branch November 16, 2018 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor bleve tokenizer usage #2738

Refactor bleve tokenizer usage #2738

srfrog commented Nov 8, 2018 •

edited by manishrjain

Loading

danielmai left a comment

manishrjain left a comment

srfrog left a comment

Refactor bleve tokenizer usage #2738

Refactor bleve tokenizer usage #2738

Conversation

srfrog commented Nov 8, 2018 • edited by manishrjain Loading

danielmai left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

srfrog left a comment

Choose a reason for hiding this comment

srfrog commented Nov 8, 2018 •

edited by manishrjain

Loading