External Tokenizers/Pre-tokenized text #642

acertain · 2019-08-23T07:23:03Z

It'd be nice to have a easy way to load pre-tokenized text with (optional?) offset data.

Would also be nice if it was accessible from the python API, since it's easy to get really good tokenization in python.

EDIT: the problem with this is queries.
solr let's you specify an analyzer chain or (determined at query time) give it a token stream for queries: https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/schema/PreAnalyzedFieldTest.java

But that's pretty unsatisfying, since you either can't use the solr query parsers as well (you need to implement your own query parser to determine what parts of the query to analyze) or get sub-optimal results.

Maybe this is a bad idea, and the right thing is just to register tokenizers from python?

fulmicoton · 2019-08-29T01:17:59Z

This seems like a good request.

Ideally we would like to handle the structured version of it too, so that people can handle proper synonyms, and keep track of the original offsets.

Also assuming you (or someone else) are facing today a use case and want a workaround:

if you do not care about positions, tantivy handles multi valued fields. Using a raw tokenizer, and passing a list of tokens ["hello","happy", "tax", "payer" ] should work.
You could create your own DSL to represent your token list, and implement a custom tokenizer to parse it.

fulmicoton · 2019-09-07T10:41:22Z

@petr-tik Would you like to mentor someone on that one?

petr-tik · 2019-09-08T15:10:17Z

Would be happy to. To clarify 2 things:

We need to add a rust tantivy API, which another PR to tantivy-py will then expose to users?
What is a typical use case for pre-tokenising outside the main search/indexing library? Would be great to either see some code or read a description of such a pipeline

acertain · 2019-09-20T01:34:21Z

The use case I have in mind is using a python linguistics library for lemmatization & tokenization, e.g. spacy (https://spacy.io/api/token / https://spacy.io/usage/linguistic-features#tokenization). Simple use of spacy might look like

nlp = spacy.load("en_core_web_sm")
def tokenize(text):
  return [(x.lemma_, x.offset, len(x)) for x in nlp(text)]

* Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream.

* Added handling of pre-tokenized text fields (#642). * * Updated changelog and examples concerning #642. * Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream. * * Removed tokenized flag from TextOptions and code reliance on the flag. * Changed naming to use word "pre-tokenized" instead of "tokenized". * Updated example code. * Fixed comments. * Minor code refactoring. Test improvements.

fulmicoton added enhancement help wanted labels Aug 29, 2019

fulmicoton added the Hacktoberfest label Sep 7, 2019

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 21, 2019

Added handling of pre-tokenized text fields (quickwit-oss#642).

5a29210

kkoziara mentioned this issue Oct 21, 2019

Added handling of pre-tokenized text fields (#642). #669

Merged

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 21, 2019

Added handling of pre-tokenized text fields (quickwit-oss#642).

75d3676

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 23, 2019

* Updated changelog and examples concerning quickwit-oss#642.

3a109a3

* Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream.

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 26, 2019

Added handling of pre-tokenized text fields (quickwit-oss#642).

faaecad

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 26, 2019

* Updated changelog and examples concerning quickwit-oss#642.

20d2235

* Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream.

fulmicoton closed this as completed in #669 Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External Tokenizers/Pre-tokenized text #642

External Tokenizers/Pre-tokenized text #642

acertain commented Aug 23, 2019 •

edited

Loading

fulmicoton commented Aug 29, 2019

fulmicoton commented Sep 7, 2019

petr-tik commented Sep 8, 2019

acertain commented Sep 20, 2019

External Tokenizers/Pre-tokenized text #642

External Tokenizers/Pre-tokenized text #642

Comments

acertain commented Aug 23, 2019 • edited Loading

fulmicoton commented Aug 29, 2019

fulmicoton commented Sep 7, 2019

petr-tik commented Sep 8, 2019

acertain commented Sep 20, 2019

acertain commented Aug 23, 2019 •

edited

Loading