-
-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Tokenizers/Pre-tokenized text #642
Comments
This seems like a good request. Ideally we would like to handle the structured version of it too, so that people can handle proper synonyms, and keep track of the original offsets. Also assuming you (or someone else) are facing today a use case and want a workaround:
|
@petr-tik Would you like to mentor someone on that one? |
Would be happy to. To clarify 2 things:
|
The use case I have in mind is using a python linguistics library for lemmatization & tokenization, e.g. spacy (https://spacy.io/api/token / https://spacy.io/usage/linguistic-features#tokenization). Simple use of spacy might look like nlp = spacy.load("en_core_web_sm")
def tokenize(text):
return [(x.lemma_, x.offset, len(x)) for x in nlp(text)] |
* Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream.
* Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream.
* Added handling of pre-tokenized text fields (#642). * * Updated changelog and examples concerning #642. * Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream. * * Removed tokenized flag from TextOptions and code reliance on the flag. * Changed naming to use word "pre-tokenized" instead of "tokenized". * Updated example code. * Fixed comments. * Minor code refactoring. Test improvements.
It'd be nice to have a easy way to load pre-tokenized text with (optional?) offset data.
e.g. solr has PreAnalyzedField.
Would also be nice if it was accessible from the python API, since it's easy to get really good tokenization in python.
EDIT: the problem with this is queries.
solr let's you specify an analyzer chain or (determined at query time) give it a token stream for queries: https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/schema/PreAnalyzedFieldTest.java
But that's pretty unsatisfying, since you either can't use the solr query parsers as well (you need to implement your own query parser to determine what parts of the query to analyze) or get sub-optimal results.
Maybe this is a bad idea, and the right thing is just to register tokenizers from python?
The text was updated successfully, but these errors were encountered: