Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Japanese)(NamedEntities) Entites are not recognized if there are no spaces within the sentence #1177

Open
kieferyap opened this issue Aug 24, 2022 · 11 comments

Comments

@kieferyap
Copy link

Describe the bug
Extracting entities from Japanese utterance when there are no spaces results in only the first entity being recognized. The rest are ignored.

To Reproduce
So I feed the manager some named entities, and I run the following code:
const response = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を食べました')
The code correctly recognizes the entities 山田、田中、and 穴子

{
  locale: 'ja',
  utterance: '山田さん と 田中さん は 穴子寿司を食べました',
  context: undefined,
  settings: { builtins: [], threshold: 0.7 },
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '開発課の社員',
      sourceText: '山田',
      utteranceText: '山田'
    },
    {
      start: 7,
      end: 8,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '部長と課長',
      sourceText: '田中',
      utteranceText: '田中'
    },
    {
      start: 14,
      end: 15,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '食べ物',
      type: 'enum',
      option: '寿司',
      sourceText: '穴子',
      utteranceText: '穴子'
    }
  ]
}

However, if I run the following code
const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
(Note the lack of spaces in the utterance)

Only the first entity is recognized:

{
  locale: 'ja',
  utterance: '山田さんと田中さんは穴子寿司を食べました',
  context: undefined,
  settings: { builtins: [], threshold: 0.7 },
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: '社員',
      type: 'enum',
      option: '開発課の社員',
      sourceText: '山田',
      utteranceText: '山田'
    }
  ]
}

Japanese as a whole do not use spaces, so I am wondering if it is possible to recognize the entities, if there are no spaces in the utterance?

Expected behavior
I expected both cases to return the same thing: three entities recognized.

Desktop (please complete the following information):

  • OS: macOS
  • node-nlp 4.24.0
  • Node version 16.13.0

Additional context
This is pretty much my code. If there's anything I did wrong, please feel free to tell me...!

const { NlpManager } = require('node-nlp')
const language = "ja"
const manager = new NlpManager({
  languages: [language],
  ner: {
    builtins: [],
    threshold: 0.7
  }
})
const namedEntities = require("./named-entities.json")

// Add the named entities
for (var key in namedEntities) {
  console.log(namedEntities[key])
  manager.addNamedEntityText(
    namedEntities[key].noun,
    namedEntities[key].type,
    [language],
    namedEntities[key].entities,
  )
}

(async () => {
  await manager.train()
  manager.save("./ner-model.nlp")
  const response = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log(response)
  console.log('===DONE===')
})()
@Apollon77
Copy link
Contributor

Does that also happen with v4?

@kieferyap
Copy link
Author

Does that also happen with v4?

Yeah, the version I'm currently using right now is v4.24.0

@Apollon77
Copy link
Contributor

Ahh yes, I see ;-)

I did some stuff on entities and can look into it but I need a"Minimum reproduction example". so can you provide one defined entitiy which should be matched multiple times?

@kieferyap
Copy link
Author

Here's a minimum reproduction example.
Hope it helps!

const { NlpManager } = require('node-nlp');
const language = "ja";
const manager = new NlpManager({
  languages: [language],
  ner: {
    builtins: [],
    threshold: 0.7
  }
});
manager.addNamedEntityText(
  'person',
  'employee',
  [language],
  ['山田', '佐藤'],
);
manager.addNamedEntityText(
  'person',
  'superior',
  [language],
  ['田中', '嶋田'],
);
manager.addNamedEntityText(
  'food',
  'sushi',
  [language],
  ['穴子', 'マグロ'],
);

(async () => {
  await manager.train();
  manager.save('./ner-model.nlp');

  const unspacedResponse = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log('unspacedResponse', unspacedResponse);

  const spacedResponse = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を 食べました')
  console.log('spacedResponse', spacedResponse);
})();

I expected unspacedResponse and spacedResponse to be the same, but it wasn't (due to the lack of spaces), so I was wondering if unspaced sentences are unsupported...

@Apollon77
Copy link
Contributor

Ok, the reason is that the code that tries to "tokenize" the input when using extractEntities seems to not correctly handle japanese input without spaces. Thats why the entity matching then can not find anything.

The code there uses the "lang-min" package and analyze characters (https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/extractor-enum.js#L48 uses in fact https://github.com/axa-group/nlp.js/blob/master/packages/language-min/src/language.js ...

The question is why the code is not using the same tokenizers as the normal nlu processing?

In your example the "tokenize" run would result in

山田 さん と 田中 さん は 穴子 寿司 を 食べ まし た

as result and with this as input the result of extracEntities is as expected:

unspacedResponse-tokenized {
  locale: 'ja',
  utterance: '山田 さん と 田中 さん は 穴子 寿司 を 食べ まし た',
  context: undefined,
  settings: {},
  sourceEntities: [],
  entities: [
    {
      start: 0,
      end: 1,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'person',
      type: 'enum',
      option: 'employee',
      sourceText: '山田',
      utteranceText: '山田'
    },
    {
      start: 8,
      end: 9,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'person',
      type: 'enum',
      option: 'superior',
      sourceText: '田中',
      utteranceText: '田中'
    },
    {
      start: 16,
      end: 17,
      len: 2,
      levenshtein: 0,
      accuracy: 1,
      entity: 'food',
      type: 'enum',
      option: 'sushi',
      sourceText: '穴子',
      utteranceText: '穴子'
    }
  ]
}

So one idea could be to enhance the code to use the language specific tokenization also for extractEntities and not the minimalistic one - if available!

Here is an adjusted code to achieve that manually for now ... maybe you can experiment around a bit with it ...

const { dockStart } = require('@nlpjs/basic');

(async () => {
  const language = "ja";
  const dock = await dockStart({
    settings: {
       nlp: {
           languages: [language],
       }
    },
    use: ['Basic', 'LangJa'],
  });
  const container = dock.getContainer();

  const manager = dock.get('nlp');

manager.addNerRuleOptionTexts(
  [language],
  'person',
  'employee',
  ['山田', '佐藤'],
);
manager.addNerRuleOptionTexts(
  [language],
  'person',
  'superior',
  ['田中', '嶋田'],
);
manager.addNerRuleOptionTexts(
  [language],
  'food',
  'sushi',
  ['穴子', 'マグロ'],
);


  const unspacedResponse = await manager.extractEntities('ja', '山田さんと田中さんは穴子寿司を食べました')
  console.log('unspacedResponse', unspacedResponse);

  const spacedResponse = await manager.extractEntities('ja', '山田さん と 田中さん は 穴子寿司を 食べました')
  console.log('spacedResponse', spacedResponse);

  const tokenizer = dock.get('tokenize');
  const res = await tokenizer.run({locale: 'ja', text: '山田さんと田中さんは穴子寿司を食べました'});
  const unspacedResponseTokenized = await manager.extractEntities('ja', res.tokens.join(' '));
  console.log('unspacedResponse-tokenized', unspacedResponseTokenized);
})();

@Apollon77
Copy link
Contributor

Apollon77 commented Aug 25, 2022

PS: The interesting question is of the entity extraction works when you use "nlp.process" with that sentence ...

EDIT: To answer my own question ... yes the normal nlp process have the same issue ... so yes it could be an idea as described above to use better tokenizers before doing enum entitty matching

@Apollon77
Copy link
Contributor

Apollon77 commented Aug 26, 2022

Ok, I did a change locally to add the tokenizer to the enum extractor logic ... I still need to do tests and need to see if I can make a PR r if I conflict with my other 4 unmerged PRs ...

Edit: ok, tests done, but PR blocked by unmerged PRs sorry. but workaround above should work for now

@kieferyap
Copy link
Author

Really sorry for the late reply! I've been very busy this past couple of days...

It's a little odd how the tokenizer doesn't work when handling unspaced Japanese sentences though, but I'm glad that the language-specific tokenization works.

I've confirmed that the workaround works, so I'm gonna close this issue now.
Thanks again for the help!! I really appreciate it.

@Apollon77
Copy link
Contributor

@kieferyap please reopen because I would like to do the PR to do that automatically in extractExntities call so that it also works for entity extraction on nlp processing and such

@kieferyap
Copy link
Author

I see! Okay, I've reopened it.

@kieferyap kieferyap reopened this Aug 29, 2022
@Apollon77
Copy link
Contributor

should be adressed by #1219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants