-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor bleve tokenizer usage #2738
Refactor bleve tokenizer usage #2738
Conversation
removed language-specific fulltext indexes and using single one "fulltext". added proxy for language stop tokens that supports all bleve languages. added proxy for language stemmers that supports all bleve languages. saving state, still need to update index code and update vendor.
added some debugging logs.
updates tokenizers to use language tags. reduced bleve code use significantly. improved stop tokens and stemmer proxies. replaced lang aliases with language bases for better matching.
fixed bug with term tokenizer.
added tokenizer language context. other minor cleanups and tweaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: 1 of 165 files reviewed, 2 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)
wiki/content/query-language/index.md, line 371 at r3 (raw file):
Dgraph uses [bleve](https://github.com/blevesearch/bleve) for its full text search indexing. See also the bleve language specific [stop word lists](https://github.com/blevesearch/bleve/tree/master/analysis/lang). Following table contains all supported languages, corresponding country-codes, stop words and stemming filtering support.
Mention "stemming" before "stop words" since that's the order they're presented in the table below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 165 files reviewed, 12 unresolved discussions (waiting on @golangcibot, @srfrog, and @manishrjain)
posting/index.go, line 65 at r3 (raw file):
"Consider switching to hash for better performance.\n", attr) } toks, err := tok.BuildTokens(sv.Value, it.SetLang(lang))
getLangTokenizer(t, lang)
a helper func.
tok/bleve.go, line 91 at r3 (raw file):
} func getFullTextTokens(s, lang string) ([]string, error) {
Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.
tok/langbase.go, line 55 at r3 (raw file):
return lang } tag, _ := language.Parse(lang)
err
x.Check(err)
tok/langbase.go, line 57 at r3 (raw file):
tag, _ := language.Parse(lang) if tag != language.Und { // something like "x-klingon" should retag to "en" if base, conf := tag.Base(); conf > language.No {
Add some comments explaining what's going on.
tok/tok.go, line 84 at r2 (raw file):
registerTokenizer(MonthTokenizer{}) registerTokenizer(DayTokenizer{}) registerTokenizer(TermTokenizer{})
Where did TermTokenizer go?
tok/tok.go, line 119 at r2 (raw file):
func registerTokenizer(t Tokenizer) { if _, found := tokenizers[t.Name()]; found { glog.V(3).Infof("Duplicate tokenizer: %s", t.Name())
This should be an assert failure.
tok/tok.go, line 123 at r2 (raw file):
} if _, ok := types.TypeForName(t.Type()); !ok { glog.V(3).Infof("Invalid type %q for tokenizer %s", t.Type(), t.Name())
Assert failure again.
tok/tok.go, line 136 at r2 (raw file):
return types.IndexGeoTokens(v.(geom.T)) } func (t GeoTokenizer) SetLang(string) Tokenizer { return t }
This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.
worker/stringfilter.go, line 110 at r3 (raw file):
tokens, err := tok.BuildTokens(value.Value, tokenizer.SetLang(filter.lang)) if err != nil { glog.V(3).Infof("error while building tokens: %s", err)
glog.Errorf(...)
worker/task.go, line 1271 at r2 (raw file):
return nil, x.Errorf("Could not find tokenizer with name %q", tokerName) } fc.tokens, err = tok.BuildTokens(valToTok.Value, tokenizer.SetLang(langForFunc(q.Langs)))
getIfLang(...)
…k pkg. added caching to langbase and removed the top 20 lang list. removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func. updated docs and tests.
…up steps. TermTokernizer and FullTextTokenizer have all the Tokens code within. added small lookup table to langBase to cache known tags. removed Bleve filter pkgs stemmerx and stopx, using only funcs. changed schema.TokenizerNames to use existing code in schema.Tokenizer. fixed and added new tests and comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 167 files reviewed, 12 unresolved discussions (waiting on @danielmai, @golangcibot, and @manishrjain)
posting/index.go, line 40 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
maxBatchSize
is unused
Done.
posting/index.go, line 65 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
getLangTokenizer(t, lang)
a helper func.
Done.
tok/bleve.go, line 91 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Make this part of the fulltext tokenizer struct. So, we can keep the objects we're receiving from bleveCache in that struct, instead of creating new stopx and stemmerx structs.
Done.
tok/langbase.go, line 55 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
err
x.Check(err)
I checked and we can't use that because a syntax error is possible. And I think language syntax errors shouldnt terminate Dgraph, but instead use the default lang "en".
tok/langbase.go, line 57 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add some comments explaining what's going on.
Done.
tok/tok.go, line 84 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Where did TermTokenizer go?
I've put them back in tok.go
, and do the bleve setup separately in bleve.go
.
tok/tok.go, line 119 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This should be an assert failure.
Done.
tok/tok.go, line 123 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Assert failure again.
Done.
tok/tok.go, line 136 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This func seems very specific to FTS. Doesn't seem to make sense to have it in every tokenizer.
Done.
wiki/content/query-language/index.md, line 371 at r3 (raw file):
Previously, danielmai (Daniel Mai) wrote…
Mention "stemming" before "stop words" since that's the order they're presented in the table below.
Done.
worker/stringfilter.go, line 110 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
glog.Errorf(...)
Done.
worker/task.go, line 1271 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
getIfLang(...)
Done.
…ue-2622_update_bleve_language_support
…hub.com:/dgraph-io/dgraph into srfrog/issue-2622_update_bleve_language_support
* removed old bleve deps. removed snowball. removed language-specific fulltext indexes and using single one "fulltext". added proxy for language stop tokens that supports all bleve languages. added proxy for language stemmers that supports all bleve languages. saving state, still need to update index code and update vendor. * fixed issue with stemmer overwrites. added some debugging logs. * changing index to use new fulltext tokenizer. * fixed wrong name for porter stemmer * fixed name of Danish stemmer. * fixed copy-n-paste typos. * refactored term and fulltext tokenizers. updates tokenizers to use language tags. reduced bleve code use significantly. improved stop tokens and stemmer proxies. replaced lang aliases with language bases for better matching. * fixed some tests. fixed bug with term tokenizer. * restored Tokenizer Tokens signature to not break tokernizer plugins. added tokenizer language context. other minor cleanups and tweaks. * updated docs to reflect the new language support for fulltext tokens. * simplified stopwords and stemmers proxies and folded them into the tok pkg. added caching to langbase and removed the top 20 lang list. removed SetLang func in Tokenizer interface and created GetLangTokernizer helper func. updated docs and tests. * moved term and fulltext analyzers to global vars to remove extra lookup steps. TermTokernizer and FullTextTokenizer have all the Tokens code within. added small lookup table to langBase to cache known tags. removed Bleve filter pkgs stemmerx and stopx, using only funcs. changed schema.TokenizerNames to use existing code in schema.Tokenizer. fixed and added new tests and comments. * deprecating stopwords scrapping script as it is not needed. * updating vendor deps * fixed test * Manish Review * fixed dep issue * fixed test for new logic
This PR refactors bleve term and fulltext tokenizers used in predicate indexes.
Closes #2622
Closes #2601
Replaces #2602
This change is