Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bleve tokenizer usage #2738

Merged
merged 20 commits into from
Nov 16, 2018

Conversation

srfrog
Copy link
Contributor

@srfrog srfrog commented Nov 8, 2018

This PR refactors bleve term and fulltext tokenizers used in predicate indexes.

  • Adds support for base language stop tokens and stemmer filters of all bleve language tokenizers.
  • Deprecates the language index error for languages that don't have fulltext tokenizer.
  • Removes snowball dependency
  • Deprecates contrib script that generates the stopword dictionaries
  • Update bleve to the current stable version, while reducing its footprint in alpha.
  • Moderate performance gains for fulltext tokenizing.

Closes #2622
Closes #2601

Replaces #2602


This change is Reviewable

srfrog added 7 commits November 6, 2018 23:47
removed language-specific fulltext indexes and using single one "fulltext".
added proxy for language stop tokens that supports all bleve languages.
added proxy for language stemmers that supports all bleve languages.
saving state, still need to update index code and update vendor.
added some debugging logs.
updates tokenizers to use language tags.
reduced bleve code use significantly.
improved stop tokens and stemmer proxies.
replaced lang aliases with language bases for better matching.
posting/index.go Outdated Show resolved Hide resolved
fixed bug with term tokenizer.
@srfrog srfrog requested a review from manishrjain November 8, 2018 05:17
added tokenizer language context.
other minor cleanups and tweaks.
Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: 1 of 165 files reviewed, 2 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)


wiki/content/query-language/index.md, line 371 at r3 (raw file):

Dgraph uses [bleve](https://github.com/blevesearch/bleve) for its full text search indexing. See also the bleve language specific [stop word lists](https://github.com/blevesearch/bleve/tree/master/analysis/lang).

Following table contains all supported languages, corresponding country-codes, stop words and stemming filtering support.

Mention "stemming" before "stop words" since that's the order they're presented in the table below.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 165 files reviewed, 12 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)


posting/index.go, line 65 at r3 (raw file):

				"Consider switching to hash for better performance.\n", attr)
		}
		toks, err := tok.BuildTokens(sv.Value, it.SetLang(lang))

getLangTokenizer(t, lang) a helper func.


tok/bleve.go, line 91 at r3 (raw file):

}

func getFullTextTokens(s, lang string) ([]string, error) {

Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.


tok/langbase.go, line 55 at r3 (raw file):

		return lang
	}
	tag, _ := language.Parse(lang)

err

x.Check(err)


tok/langbase.go, line 57 at r3 (raw file):

	tag, _ := language.Parse(lang)
	if tag != language.Und { // something like "x-klingon" should retag to "en"
		if base, conf := tag.Base(); conf > language.No {

Add some comments explaining what's going on.


tok/tok.go, line 84 at r2 (raw file):

	registerTokenizer(MonthTokenizer{})
	registerTokenizer(DayTokenizer{})
	registerTokenizer(TermTokenizer{})

Where did TermTokenizer go?


tok/tok.go, line 119 at r2 (raw file):

func registerTokenizer(t Tokenizer) {
	if _, found := tokenizers[t.Name()]; found {
		glog.V(3).Infof("Duplicate tokenizer: %s", t.Name())

This should be an assert failure.


tok/tok.go, line 123 at r2 (raw file):

	}
	if _, ok := types.TypeForName(t.Type()); !ok {
		glog.V(3).Infof("Invalid type %q for tokenizer %s", t.Type(), t.Name())

Assert failure again.


tok/tok.go, line 136 at r2 (raw file):

	return types.IndexGeoTokens(v.(geom.T))
}
func (t GeoTokenizer) SetLang(string) Tokenizer { return t }

This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.


worker/stringfilter.go, line 110 at r3 (raw file):

	tokens, err := tok.BuildTokens(value.Value, tokenizer.SetLang(filter.lang))
	if err != nil {
		glog.V(3).Infof("error while building tokens: %s", err)

glog.Errorf(...)


worker/task.go, line 1271 at r2 (raw file):

			return nil, x.Errorf("Could not find tokenizer with name %q", tokerName)
		}
		fc.tokens, err = tok.BuildTokens(valToTok.Value, tokenizer.SetLang(langForFunc(q.Langs)))

getIfLang(...)

srfrog added 2 commits November 9, 2018 10:28
…k pkg.

added caching to langbase and removed the top 20 lang list.
removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func.
updated docs and tests.
…up steps.

TermTokernizer and FullTextTokenizer have all the Tokens code within.
added small lookup table to langBase to cache known tags.
removed Bleve filter pkgs stemmerx and stopx, using only funcs.
changed schema.TokenizerNames to use existing code in schema.Tokenizer.
fixed and added new tests and comments.
Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 167 files reviewed, 12 unresolved discussions (waiting on @danielmai, @golangcibot, and @manishrjain)


posting/index.go, line 40 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

maxBatchSize is unused

Done.


posting/index.go, line 65 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

getLangTokenizer(t, lang) a helper func.

Done.


tok/bleve.go, line 91 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.

Done.


tok/langbase.go, line 55 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

err

x.Check(err)

I checked and we can't use that because a syntax error is possible. And I think language syntax errors shouldnt terminate Dgraph, but instead use the default lang "en".


tok/langbase.go, line 57 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add some comments explaining what's going on.

Done.


tok/tok.go, line 84 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Where did TermTokenizer go?

I've put them back in tok.go, and do the bleve setup separately in bleve.go.


tok/tok.go, line 119 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This should be an assert failure.

Done.


tok/tok.go, line 123 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Assert failure again.

Done.


tok/tok.go, line 136 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.

Done.


wiki/content/query-language/index.md, line 371 at r3 (raw file):

Previously, danielmai (Daniel Mai) wrote…

Mention "stemming" before "stop words" since that's the order they're presented in the table below.

Done.


worker/stringfilter.go, line 110 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

glog.Errorf(...)

Done.


worker/task.go, line 1271 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

getIfLang(...)

Done.

@srfrog srfrog merged commit c43c050 into master Nov 16, 2018
@srfrog srfrog deleted the srfrog/issue-2622_update_bleve_language_support branch November 16, 2018 23:20
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
* removed old bleve deps. removed snowball.
removed language-specific fulltext indexes and using single one "fulltext".
added proxy for language stop tokens that supports all bleve languages.
added proxy for language stemmers that supports all bleve languages.
saving state, still need to update index code and update vendor.

* fixed issue with stemmer overwrites.
added some debugging logs.

* changing index to use new fulltext tokenizer.

* fixed wrong name for porter stemmer

* fixed name of Danish stemmer.

* fixed copy-n-paste typos.

* refactored term and fulltext tokenizers.
updates tokenizers to use language tags.
reduced bleve code use significantly.
improved stop tokens and stemmer proxies.
replaced lang aliases with language bases for better matching.

* fixed some tests.
fixed bug with term tokenizer.

* restored Tokenizer Tokens signature to not break tokernizer plugins.
added tokenizer language context.
other minor cleanups and tweaks.

* updated docs to reflect the new language support for fulltext tokens.

* simplified stopwords and stemmers proxies and folded them into the tok pkg.
added caching to langbase and removed the top 20 lang list.
removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func.
updated docs and tests.

* moved term and fulltext analyzers to global vars to remove extra lookup steps.
TermTokernizer and FullTextTokenizer have all the Tokens code within.
added small lookup table to langBase to cache known tags.
removed Bleve filter pkgs stemmerx and stopx, using only funcs.
changed schema.TokenizerNames to use existing code in schema.Tokenizer.
fixed and added new tests and comments.

* deprecating stopwords scrapping script as it is not needed.

* updating vendor deps

* fixed test

* Manish Review

* fixed dep issue

* fixed test for new logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants