verbs ending in `-issions` tokenized incorrectly #1

joprice · 2024-07-23T22:13:30Z

When using a model like qanastek/pos-french-camembert, a verb such as finissions results in multiple tokens with VERB entities like ["fini" VERB", "ssions" VERB]. This does not happen with the flair based model, but unfortunately I can't figure out how to export that one to onnx, so I'm unable to integrate it currently.

The text was updated successfully, but these errors were encountered:

joprice · 2024-07-23T22:34:36Z

After reading through the TokenClassificationPipeline code a bit, it seems aggregation_strategy="simple" solves this, resulting in a single item in the result set for the verb with the field entity_group fields instead of entity.

joprice · 2024-07-24T01:42:51Z

It looks like this strategy is based on tokens that include "I-" and "B-" prefixes https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/pipelines/token_classification.py#L550. I'm not familiar enough to know if I should expect them to appear in this kind of model. However, without detecting token boundaries, adjacent nouns and adjectives will be grouped into single tokens. Also, other implementations might lag behind supporting strategies like this, e.g. huggingface/transformers.js#633.

My first hunch is to that the solution is to modify the model's tokenizer to add the extra token prefixes to correctly merge token groups, but not sure if there's actually an earlier issue with something like lemmatization where the verb is being incorrectly split into root 'fini' and its suffix.

joprice · 2024-07-24T01:58:13Z

I just found this article https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762 which clarifies the use of the prefixes in NER tagging and makes sense now that the flair model that uses a sequence tagger can handles this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verbs ending in `-issions` tokenized incorrectly #1

verbs ending in `-issions` tokenized incorrectly #1

joprice commented Jul 23, 2024

joprice commented Jul 23, 2024

joprice commented Jul 24, 2024

joprice commented Jul 24, 2024

verbs ending in -issions tokenized incorrectly #1

verbs ending in -issions tokenized incorrectly #1

Comments

joprice commented Jul 23, 2024

joprice commented Jul 23, 2024

joprice commented Jul 24, 2024

joprice commented Jul 24, 2024

verbs ending in `-issions` tokenized incorrectly #1

verbs ending in `-issions` tokenized incorrectly #1