Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Tokenizers/Pre-tokenized text #642

Closed
acertain opened this issue Aug 23, 2019 · 4 comments · Fixed by #669
Closed

External Tokenizers/Pre-tokenized text #642

acertain opened this issue Aug 23, 2019 · 4 comments · Fixed by #669

Comments

@acertain
Copy link

acertain commented Aug 23, 2019

It'd be nice to have a easy way to load pre-tokenized text with (optional?) offset data.

e.g. solr has PreAnalyzedField.

Would also be nice if it was accessible from the python API, since it's easy to get really good tokenization in python.

EDIT: the problem with this is queries.
solr let's you specify an analyzer chain or (determined at query time) give it a token stream for queries: https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/schema/PreAnalyzedFieldTest.java

But that's pretty unsatisfying, since you either can't use the solr query parsers as well (you need to implement your own query parser to determine what parts of the query to analyze) or get sub-optimal results.

Maybe this is a bad idea, and the right thing is just to register tokenizers from python?

@fulmicoton
Copy link
Collaborator

This seems like a good request.

Ideally we would like to handle the structured version of it too, so that people can handle proper synonyms, and keep track of the original offsets.

Also assuming you (or someone else) are facing today a use case and want a workaround:

  • if you do not care about positions, tantivy handles multi valued fields. Using a raw tokenizer, and passing a list of tokens ["hello","happy", "tax", "payer" ] should work.
  • You could create your own DSL to represent your token list, and implement a custom tokenizer to parse it.

@fulmicoton
Copy link
Collaborator

@petr-tik Would you like to mentor someone on that one?

@petr-tik
Copy link
Contributor

petr-tik commented Sep 8, 2019

Would be happy to. To clarify 2 things:

  1. We need to add a rust tantivy API, which another PR to tantivy-py will then expose to users?
  2. What is a typical use case for pre-tokenising outside the main search/indexing library? Would be great to either see some code or read a description of such a pipeline

@acertain
Copy link
Author

The use case I have in mind is using a python linguistics library for lemmatization & tokenization, e.g. spacy (https://spacy.io/api/token / https://spacy.io/usage/linguistic-features#tokenization). Simple use of spacy might look like

nlp = spacy.load("en_core_web_sm")
def tokenize(text):
  return [(x.lemma_, x.offset, len(x)) for x in nlp(text)]

kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 21, 2019
kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 21, 2019
kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 23, 2019
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.
kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 26, 2019
kkoziara added a commit to kkoziara/tantivy that referenced this issue Oct 26, 2019
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.
fulmicoton pushed a commit that referenced this issue Nov 7, 2019
* Added handling of pre-tokenized text fields (#642).

* * Updated changelog and examples concerning #642.
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.

* * Removed tokenized flag from TextOptions and code reliance on the flag.
* Changed naming to use word "pre-tokenized" instead of "tokenized".
* Updated example code.
* Fixed comments.

* Minor code refactoring. Test improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants