feat: use tokenizers to split words before doing entity detection #1219
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template
PR Checklist
npm test
locally and all tests are passing.PR Description
This PR is a fix for an issue reported in the repository (need to check if I can find it again). The problem is that the entity extraction was using normal whitespaces till now and this is not working in languages like japanese which is not using whitespaces.
So this PR introduce the usage of the tokenizer to splitup the text into words and then do the entity extraction based on this text. It also has the neded logic to correctly map the "positions" then back to the original text, so that the positions contained in the entity details are correct