Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support empty alphabet, for simple CJK word segmentation #75

Open
unhammer opened this issue Oct 28, 2019 · 11 comments
Open

Support empty alphabet, for simple CJK word segmentation #75

unhammer opened this issue Oct 28, 2019 · 11 comments

Comments

@unhammer
Copy link
Member

Before
944ed25 / #52
it was possible to use monodix files with an empty <alphabet> in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.

See 944ed25#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.

Maybe the iswalnum test could be turned off by a flag, e.g. lt-proc --no-implicit-alphabet ?

unhammer referenced this issue Oct 28, 2019
Solves #45
Consider alphanumeric characters to be part of the vocabulary.
@TinoDidriksen
Copy link
Member

TinoDidriksen commented Oct 28, 2019

Surely this is as trivial as adding a alphabetic_chars.empty() check to the condition.

@unhammer
Copy link
Member Author

unhammer commented Oct 28, 2019

What if someone wants only some chars to be unknown-tokenizable?

@TinoDidriksen
Copy link
Member

I guess. I say this should be an opt-out, then. Default should be to have as much as possible in the alphabet, and people can then opt-out with something like <alphabet verbatim="true">

@unhammer
Copy link
Member Author

Definitely opt-out, which is why I suggested --no-implicit-alphabet, though an attribute would be great too. However, an attribute would require a change to the binary format, wouldn't it? (If the iswalnum check is in lt-proc, not lt-comp.)

@TinoDidriksen
Copy link
Member

The last binary break prepared for this eventuality: https://github.com/apertium/lttoolbox/blob/master/lttoolbox/compression.h#L29 - we can add features without breaking existing files. But yeah, a cmdline flag for now would work.

@ftyers
Copy link
Member

ftyers commented Oct 28, 2019

Regarding #52 isn't this what the inconditional section is for?

@unhammer
Copy link
Member Author

oh yeah :) @Fred-Git-Hub ↑ would this cover your use-case? With

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="noun"/>
      <sdef n="verb"/>
   </sdefs>

   <section id="main" type="inconditional">
      <e><p><l>我</l><r>我<s n="noun"/></r></p></e>
      <e><p><l>爱</l><r>爱<s n="verb"/></r></p></e>
      <e><p><l>你</l><r>你<s n="noun"/></r></p></e>
   </section>

</dictionary>

I get

$ echo "我爱你" | lt-proc test.bin
^我/我<noun>$^爱/爱<verb>$^你/你<noun>$

(See http://wiki.apertium.org/wiki/Inconditional#inconditional for more info.)

@unhammer
Copy link
Member Author

unhammer commented Oct 28, 2019

well, the problem is that anything without an analysis in inconditional would turn what follows into one big unknown:

$ echo "熊猫 爱你" |lt-proc test.bin   # space after the bear:
^熊猫/*熊猫$ ^爱/爱<verb>$^你/你<noun>$
$ echo "熊猫爱你" |lt-proc test.bin    # no space, big unknown:
^熊猫爱你/*熊猫爱你$

so then you'd have to make sure to put every symbol you might expect to appear before other symbols into inconditional, including foreign ones like a and b.

@ftyers
Copy link
Member

ftyers commented Oct 28, 2019

Aha, got it @unhammer, that makes sense. In general I think that in order to deal with this properly we need (1) weights in the lexicon, and (2) a special function of lttoolbox that does segmentation... maybe something like the compounding functionality.

@unhammer
Copy link
Member Author

unhammer commented Oct 28, 2019

Yeah, I do have the feeling plain LRLM should eventually hit something it can't handle, but I wonder how far you can get with what @Fred-Git-Hub had going (if the language was mostly single-character words, it should be possible without any new features).

Languages like Thai would need something more, but the current weights and compounding features don't look at context – wouldn't context be needed? Even the simple Norwegian case of ^3./3<adj><ord>/3<num>+.<sent>$ you can't solve without looking at words that are not part of the longest-match of any of the analyses.

@ftyers
Copy link
Member

ftyers commented Oct 28, 2019

Yeah, either you'd be stuck with a unigram model or you'd need to incorporate n-gram information somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants