Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use tokenizers to split words before doing entity detection #1219

Merged
merged 1 commit into from
Jan 16, 2023

Conversation

Apollon77
Copy link
Contributor

Pull Request Template

PR Checklist

  • I have run npm test locally and all tests are passing.
  • I have added/updated tests for any new behavior.
  • If this is a significant change, an issue has already been created where the problem / solution was discussed: [N/A, or add link to issue here]

PR Description

This PR is a fix for an issue reported in the repository (need to check if I can find it again). The problem is that the entity extraction was using normal whitespaces till now and this is not working in languages like japanese which is not using whitespaces.

So this PR introduce the usage of the tokenizer to splitup the text into words and then do the entity extraction based on this text. It also has the neded logic to correctly map the "positions" then back to the original text, so that the positions contained in the entity details are correct

@sonarcloud
Copy link

sonarcloud bot commented Nov 29, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug C 1 Bug
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 5 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@Apollon77
Copy link
Contributor Author

@eric-lara For the sonarcloud "security hotspot" and "bug" ... it is a regex also used in other places ... Yes should be checked but I'm not really a regex expert ... Do you maybe have a proposal how to handle it?

BTW: Ready for review

@Apollon77
Copy link
Contributor Author

@ericzon All PRs ready for review

@ericzon ericzon merged commit 0fb309a into axa-group:master Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants