(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

kieferyap · 2022-08-24T07:26:33Z

Describe the bug
Extracting entities from Japanese utterance when there are no spaces results in only the first entity being recognized. The rest are ignored.

To Reproduce
So I feed the manager some named entities, and I run the following code:
const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
The code correctly recognizes the entities 山田、田中、and 穴子

{
  locale: 'ja',
  utterance: '山田さん と 田中さん は 穴子寿司を食べました',
  context: undefined,
  settings: { builtins: [], threshold: 0.7 },
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '開発課の社員',
      sourceText: '山田',
      utteranceText: '山田'
    },
    {
      start: 7,
      end: 8,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '部長と課長',
      sourceText: '田中',
      utteranceText: '田中'
    },
    {
      start: 14,
      end: 15,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '食べ物',
      type: 'enum',
      option: '寿司',
      sourceText: '穴子',
      utteranceText: '穴子'
    }
  ]
}

However, if I run the following code
const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
(Note the lack of spaces in the utterance)

Only the first entity is recognized:

{
  locale: 'ja',
  utterance: '山田さんと田中さんは穴子寿司を食べました',
  context: undefined,
  settings: { builtins: [], threshold: 0.7 },
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '開発課の社員',
      sourceText: '山田',
      utteranceText: '山田'
    }
  ]
}

Japanese as a whole do not use spaces, so I am wondering if it is possible to recognize the entities, if there are no spaces in the utterance?

Expected behavior
I expected both cases to return the same thing: three entities recognized.

Desktop (please complete the following information):

OS: macOS
node-nlp 4.24.0
Node version 16.13.0

Additional context
This is pretty much my code. If there's anything I did wrong, please feel free to tell me...!

const { NlpManager } = require('node-nlp')
const language = "ja"
const manager = new NlpManager({
  languages: [language],
  ner: {
    builtins: [],
    threshold: 0.7
  }
})
const namedEntities = require("./named-entities.json")

// Add the named entities
for (var key in namedEntities) {
  console.log(namedEntities[key])
  manager.addNamedEntityText(
    namedEntities[key].noun,
    namedEntities[key].type,
    [language],
    namedEntities[key].entities,
  )
}

(async () => {
  await manager.train()
  manager.save("./ner-model.nlp")
  const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log(response)
  console.log('===DONE===')
})()

The text was updated successfully, but these errors were encountered:

Apollon77 · 2022-08-24T08:56:55Z

Does that also happen with v4?

kieferyap · 2022-08-24T08:59:13Z

Does that also happen with v4?

Yeah, the version I'm currently using right now is v4.24.0

Apollon77 · 2022-08-24T09:10:16Z

Ahh yes, I see ;-)

I did some stuff on entities and can look into it but I need a"Minimum reproduction example". so can you provide one defined entitiy which should be matched multiple times?

kieferyap · 2022-08-24T09:22:14Z

Here's a minimum reproduction example.
Hope it helps!

const { NlpManager } = require('node-nlp');
const language = "ja";
const manager = new NlpManager({
  languages: [language],
  ner: {
    builtins: [],
    threshold: 0.7
  }
});
manager.addNamedEntityText(
  'person',
  'employee',
  [language],
  ['山田', '佐藤'],
);
manager.addNamedEntityText(
  'person',
  'superior',
  [language],
  ['田中', '嶋田'],
);
manager.addNamedEntityText(
  'food',
  'sushi',
  [language],
  ['穴子', 'マグロ'],
);

(async () => {
  await manager.train();
  manager.save('./ner-model.nlp');

  const unspacedResponse = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log('unspacedResponse', unspacedResponse);

  const spacedResponse = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を 食べました')
  console.log('spacedResponse', spacedResponse);
})();

I expected unspacedResponse and spacedResponse to be the same, but it wasn't (due to the lack of spaces), so I was wondering if unspaced sentences are unsupported...

Apollon77 · 2022-08-25T20:42:09Z

Ok, the reason is that the code that tries to "tokenize" the input when using extractEntities seems to not correctly handle japanese input without spaces. Thats why the entity matching then can not find anything.

The code there uses the "lang-min" package and analyze characters (https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/extractor-enum.js#L48 uses in fact https://github.com/axa-group/nlp.js/blob/master/packages/language-min/src/language.js ...

The question is why the code is not using the same tokenizers as the normal nlu processing?

In your example the "tokenize" run would result in

山田 さん と 田中 さん は 穴子 寿司 を 食べ まし た

as result and with this as input the result of extracEntities is as expected:

unspacedResponse-tokenized {
  locale: 'ja',
  utterance: '山田 さん と 田中 さん は 穴子 寿司 を 食べ まし た',
  context: undefined,
  settings: {},
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'person',
      type: 'enum',
      option: 'employee',
      sourceText: '山田',
      utteranceText: '山田'
    },
    {
      start: 8,
      end: 9,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'person',
      type: 'enum',
      option: 'superior',
      sourceText: '田中',
      utteranceText: '田中'
    },
    {
      start: 16,
      end: 17,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'food',
      type: 'enum',
      option: 'sushi',
      sourceText: '穴子',
      utteranceText: '穴子'
    }
  ]
}

So one idea could be to enhance the code to use the language specific tokenization also for extractEntities and not the minimalistic one - if available!

Here is an adjusted code to achieve that manually for now ... maybe you can experiment around a bit with it ...

const { dockStart } = require('@nlpjs/basic');

(async () => {
  const language = "ja";
  const dock = await dockStart({
    settings: {
       nlp: {
           languages: [language],
       }
    },
    use: ['Basic', 'LangJa'],
  });
  const container = dock.getContainer();

  const manager = dock.get('nlp');

manager.addNerRuleOptionTexts(
  [language],
  'person',
  'employee',
  ['山田', '佐藤'],
);
manager.addNerRuleOptionTexts(
  [language],
  'person',
  'superior',
  ['田中', '嶋田'],
);
manager.addNerRuleOptionTexts(
  [language],
  'food',
  'sushi',
  ['穴子', 'マグロ'],
);


  const unspacedResponse = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log('unspacedResponse', unspacedResponse);

  const spacedResponse = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を 食べました')
  console.log('spacedResponse', spacedResponse);

  const tokenizer = dock.get('tokenize');
  const res = await tokenizer.run({locale: 'ja', text: '山田さんと田中さんは穴子寿司を食べました'});
  const unspacedResponseTokenized = await manager.extractEntities('ja', res.tokens.join(' '));
  console.log('unspacedResponse-tokenized', unspacedResponseTokenized);
})();

Apollon77 · 2022-08-25T20:52:51Z

PS: The interesting question is of the entity extraction works when you use "nlp.process" with that sentence ...

EDIT: To answer my own question ... yes the normal nlp process have the same issue ... so yes it could be an idea as described above to use better tokenizers before doing enum entitty matching

Apollon77 · 2022-08-26T09:45:23Z

Ok, I did a change locally to add the tokenizer to the enum extractor logic ... I still need to do tests and need to see if I can make a PR r if I conflict with my other 4 unmerged PRs ...

Edit: ok, tests done, but PR blocked by unmerged PRs sorry. but workaround above should work for now

kieferyap · 2022-08-29T01:45:42Z

Really sorry for the late reply! I've been very busy this past couple of days...

It's a little odd how the tokenizer doesn't work when handling unspaced Japanese sentences though, but I'm glad that the language-specific tokenization works.

I've confirmed that the workaround works, so I'm gonna close this issue now.
Thanks again for the help!! I really appreciate it.

Apollon77 · 2022-08-29T06:55:37Z

@kieferyap please reopen because I would like to do the PR to do that automatically in extractExntities call so that it also works for entity extraction on nlp processing and such

kieferyap · 2022-08-29T07:29:27Z

I see! Okay, I've reopened it.

Apollon77 · 2022-11-30T08:04:07Z

should be adressed by #1219

kieferyap closed this as completed Aug 29, 2022

kieferyap reopened this Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 24, 2022

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 24, 2022

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 25, 2022

Apollon77 commented Aug 25, 2022 •

edited

Loading

Apollon77 commented Aug 26, 2022 •

edited

Loading

kieferyap commented Aug 29, 2022

Apollon77 commented Aug 29, 2022

kieferyap commented Aug 29, 2022

Apollon77 commented Nov 30, 2022

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

Comments

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 24, 2022

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 24, 2022

kieferyap commented Aug 24, 2022

Apollon77 commented Aug 25, 2022

Apollon77 commented Aug 25, 2022 • edited Loading

Apollon77 commented Aug 26, 2022 • edited Loading

kieferyap commented Aug 29, 2022

Apollon77 commented Aug 29, 2022

kieferyap commented Aug 29, 2022

Apollon77 commented Nov 30, 2022

Apollon77 commented Aug 25, 2022 •

edited

Loading

Apollon77 commented Aug 26, 2022 •

edited

Loading