-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verbs ending in -issions
tokenized incorrectly
#1
Comments
After reading through the TokenClassificationPipeline code a bit, it seems |
It looks like this strategy is based on tokens that include "I-" and "B-" prefixes https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/pipelines/token_classification.py#L550. I'm not familiar enough to know if I should expect them to appear in this kind of model. However, without detecting token boundaries, adjacent nouns and adjectives will be grouped into single tokens. Also, other implementations might lag behind supporting strategies like this, e.g. huggingface/transformers.js#633. My first hunch is to that the solution is to modify the model's tokenizer to add the extra token prefixes to correctly merge token groups, but not sure if there's actually an earlier issue with something like lemmatization where the verb is being incorrectly split into root 'fini' and its suffix. |
I just found this article https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762 which clarifies the use of the prefixes in NER tagging and makes sense now that the flair model that uses a sequence tagger can handles this case. |
When using a model like
qanastek/pos-french-camembert
, a verb such asfinissions
results in multiple tokens with VERB entities like["fini" VERB", "ssions" VERB]
. This does not happen with the flair based model, but unfortunately I can't figure out how to export that one to onnx, so I'm unable to integrate it currently.The text was updated successfully, but these errors were encountered: