Elements widget uses two different tokenizers #157

eriktks · 2025-01-21T09:59:34Z

The Elements widget uses two different tokenizers: nltk.tokenize.RegexpTokenizer.span_tokenize in tagger.py and spacy in util.py. This leads to alignment problems of the tokens and the part-of-speech tags which can be made visible by adding strict=True to the zip() calls in tagger.py. These two token sets should be harmonized.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elements widget uses two different tokenizers #157

Elements widget uses two different tokenizers #157

eriktks commented Jan 21, 2025

Elements widget uses two different tokenizers #157

Elements widget uses two different tokenizers #157

Comments

eriktks commented Jan 21, 2025