feat: use tokenizers to split words before doing entity detection #1219

Apollon77 · 2022-11-29T16:07:09Z

Pull Request Template

PR Checklist

I have run npm test locally and all tests are passing.
I have added/updated tests for any new behavior.
If this is a significant change, an issue has already been created where the problem / solution was discussed: [N/A, or add link to issue here]

PR Description

This PR is a fix for an issue reported in the repository (need to check if I can find it again). The problem is that the entity extraction was using normal whitespaces till now and this is not working in languages like japanese which is not using whitespaces.

So this PR introduce the usage of the tokenizer to splitup the text into words and then do the entity extraction based on this text. It also has the neded logic to correctly map the "positions" then back to the original text, so that the positions contained in the entity details are correct

…xes e.g. japanese entity detection

sonarcloud · 2022-11-29T16:08:36Z

SonarCloud Quality Gate failed.

1 Bug
0 Vulnerabilities
1 Security Hotspot
5 Code Smells

No Coverage information
0.0% Duplication

Apollon77 · 2022-11-29T16:12:13Z

@eric-lara For the sonarcloud "security hotspot" and "bug" ... it is a regex also used in other places ... Yes should be checked but I'm not really a regex expert ... Do you maybe have a proposal how to handle it?

BTW: Ready for review

Apollon77 · 2022-12-22T09:29:31Z

@ericzon All PRs ready for review

feat: use tokenizers to split words before doing entity detection, fi…

502c4b1

…xes e.g. japanese entity detection

Apollon77 mentioned this pull request Nov 30, 2022

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

Open

ericzon merged commit 0fb309a into axa-group:master Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use tokenizers to split words before doing entity detection #1219

feat: use tokenizers to split words before doing entity detection #1219

Apollon77 commented Nov 29, 2022

sonarcloud bot commented Nov 29, 2022

Apollon77 commented Nov 29, 2022

Apollon77 commented Dec 22, 2022

feat: use tokenizers to split words before doing entity detection #1219

feat: use tokenizers to split words before doing entity detection #1219

Conversation

Apollon77 commented Nov 29, 2022

Pull Request Template

PR Checklist

PR Description

sonarcloud bot commented Nov 29, 2022

Apollon77 commented Nov 29, 2022

Apollon77 commented Dec 22, 2022