-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177
Comments
Does that also happen with v4? |
Yeah, the version I'm currently using right now is v4.24.0 |
Ahh yes, I see ;-) I did some stuff on entities and can look into it but I need a"Minimum reproduction example". so can you provide one defined entitiy which should be matched multiple times? |
Here's a minimum reproduction example.
I expected unspacedResponse and spacedResponse to be the same, but it wasn't (due to the lack of spaces), so I was wondering if unspaced sentences are unsupported... |
Ok, the reason is that the code that tries to "tokenize" the input when using extractEntities seems to not correctly handle japanese input without spaces. Thats why the entity matching then can not find anything. The code there uses the "lang-min" package and analyze characters (https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/extractor-enum.js#L48 uses in fact https://github.com/axa-group/nlp.js/blob/master/packages/language-min/src/language.js ... The question is why the code is not using the same tokenizers as the normal nlu processing? In your example the "tokenize" run would result in
as result and with this as input the result of extracEntities is as expected:
So one idea could be to enhance the code to use the language specific tokenization also for extractEntities and not the minimalistic one - if available! Here is an adjusted code to achieve that manually for now ... maybe you can experiment around a bit with it ...
|
PS: The interesting question is of the entity extraction works when you use "nlp.process" with that sentence ... EDIT: To answer my own question ... yes the normal nlp process have the same issue ... so yes it could be an idea as described above to use better tokenizers before doing enum entitty matching |
Ok, I did a change locally to add the tokenizer to the enum extractor logic ... I still need to do tests and need to see if I can make a PR r if I conflict with my other 4 unmerged PRs ... Edit: ok, tests done, but PR blocked by unmerged PRs sorry. but workaround above should work for now |
Really sorry for the late reply! I've been very busy this past couple of days... It's a little odd how the tokenizer doesn't work when handling unspaced Japanese sentences though, but I'm glad that the language-specific tokenization works. I've confirmed that the workaround works, so I'm gonna close this issue now. |
@kieferyap please reopen because I would like to do the PR to do that automatically in extractExntities call so that it also works for entity extraction on nlp processing and such |
I see! Okay, I've reopened it. |
should be adressed by #1219 |
Describe the bug
Extracting entities from Japanese utterance when there are no spaces results in only the first entity being recognized. The rest are ignored.
To Reproduce
So I feed the manager some named entities, and I run the following code:
const response = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を食べました')
The code correctly recognizes the entities 山田、田中、and 穴子
However, if I run the following code
const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
(Note the lack of spaces in the utterance)
Only the first entity is recognized:
Japanese as a whole do not use spaces, so I am wondering if it is possible to recognize the entities, if there are no spaces in the utterance?
Expected behavior
I expected both cases to return the same thing: three entities recognized.
Desktop (please complete the following information):
Additional context
This is pretty much my code. If there's anything I did wrong, please feel free to tell me...!
The text was updated successfully, but these errors were encountered: