Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling for cases of multiple readings #17

Open
thatbudakguy opened this issue Jun 8, 2022 · 1 comment
Open

better handling for cases of multiple readings #17

thatbudakguy opened this issue Jun 8, 2022 · 1 comment
Labels
enhancement New feature or request question Further information is requested

Comments

@thatbudakguy
Copy link
Member

thatbudakguy commented Jun 8, 2022

right now there are a variety of cases where Reconstruction can raise a MultipleReadingsError: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:

  1. we see a 長 in the text without an annotation from LDM, and go to the guangyun looking for a reading.
  2. we see that 長 has three available readings: drjangH, drjang, trjangX.
  3. we divide each of these into initial, rime, and tone (see shift output representation to separate initials, finals, and tones #6).
  4. for the initial, we have two options: dr and tr.
  5. for the rime, we have only one option, which we can confidently annotate: jang.
  6. for the tone, we have three options: level, rising, and departing.

there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:

[dr|tr]jang[X|H|_]

and if we annotate each part in a separate field, this might make it into the CoNLL-U as:

MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]

(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC and FEATS fields.)

this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.

@thatbudakguy thatbudakguy added enhancement New feature or request question Further information is requested labels Jun 8, 2022
@GDRom
Copy link
Member

GDRom commented Jun 8, 2022

This sounds like a brilliant solution when compared to our previous approach. And you are right, this makes things for a human reader much clearer, as the structure you are proposing inherently draws attention to what's unclear.

Also, just to follow up on LDM, as this logically would result in either of the two things:

  1. the character LDM provided is still ambiguous, as the relevant syllable segment, for example MCInitial=[dr/tr], is ambiguous
  2. the character is not ambiguous anymore, as LDM refers to the syllable segment that is clear, for example MCRime=jang|MCTone=X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants