better handling for cases of multiple readings #17

thatbudakguy · 2022-06-08T17:35:24Z

right now there are a variety of cases where Reconstruction can raise a MultipleReadingsError: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:

we see a 長 in the text without an annotation from LDM, and go to the guangyun looking for a reading.
we see that 長 has three available readings: drjangH, drjang, trjangX.
we divide each of these into initial, rime, and tone (see shift output representation to separate initials, finals, and tones #6).
for the initial, we have two options: dr and tr.
for the rime, we have only one option, which we can confidently annotate: jang.
for the tone, we have three options: level, rising, and departing.

there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:

[dr|tr]jang[X|H|_]

and if we annotate each part in a separate field, this might make it into the CoNLL-U as:

MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]

(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC and FEATS fields.)

this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.

The text was updated successfully, but these errors were encountered:

GDRom · 2022-06-08T19:42:41Z

This sounds like a brilliant solution when compared to our previous approach. And you are right, this makes things for a human reader much clearer, as the structure you are proposing inherently draws attention to what's unclear.

Also, just to follow up on LDM, as this logically would result in either of the two things:

the character LDM provided is still ambiguous, as the relevant syllable segment, for example MCInitial=[dr/tr], is ambiguous
the character is not ambiguous anymore, as LDM refers to the syllable segment that is clear, for example MCRime=jang|MCTone=X

thatbudakguy added enhancement New feature or request question Further information is requested labels Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better handling for cases of multiple readings #17

better handling for cases of multiple readings #17

thatbudakguy commented Jun 8, 2022 •

edited

Loading

GDRom commented Jun 8, 2022

better handling for cases of multiple readings #17

better handling for cases of multiple readings #17

Comments

thatbudakguy commented Jun 8, 2022 • edited Loading

GDRom commented Jun 8, 2022

thatbudakguy commented Jun 8, 2022 •

edited

Loading