Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verbs ending in -issions tokenized incorrectly #1

Open
joprice opened this issue Jul 23, 2024 · 3 comments
Open

verbs ending in -issions tokenized incorrectly #1

joprice opened this issue Jul 23, 2024 · 3 comments

Comments

@joprice
Copy link

joprice commented Jul 23, 2024

When using a model like qanastek/pos-french-camembert, a verb such as finissions results in multiple tokens with VERB entities like ["fini" VERB", "ssions" VERB]. This does not happen with the flair based model, but unfortunately I can't figure out how to export that one to onnx, so I'm unable to integrate it currently.

@joprice
Copy link
Author

joprice commented Jul 23, 2024

After reading through the TokenClassificationPipeline code a bit, it seems aggregation_strategy="simple" solves this, resulting in a single item in the result set for the verb with the field entity_group fields instead of entity.

@joprice
Copy link
Author

joprice commented Jul 24, 2024

It looks like this strategy is based on tokens that include "I-" and "B-" prefixes https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/pipelines/token_classification.py#L550. I'm not familiar enough to know if I should expect them to appear in this kind of model. However, without detecting token boundaries, adjacent nouns and adjectives will be grouped into single tokens. Also, other implementations might lag behind supporting strategies like this, e.g. huggingface/transformers.js#633.

My first hunch is to that the solution is to modify the model's tokenizer to add the extra token prefixes to correctly merge token groups, but not sure if there's actually an earlier issue with something like lemmatization where the verb is being incorrectly split into root 'fini' and its suffix.

@joprice
Copy link
Author

joprice commented Jul 24, 2024

I just found this article https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762 which clarifies the use of the prefixes in NER tagging and makes sense now that the flair model that uses a sequence tagger can handles this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant