-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attachment of list item enumerators #518
Comments
Actually, there are 4 cases in EWT where it has a following ")" but no |
Sounds good, but would you add a few words on the appropriate UPOS tagging? In EWT we get the tags
|
also the tag on
|
Chris said he would keep |
Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that. |
should be tokenized though, no? that's the standard in ptb and ewt
…On Sat, Apr 27, 2024, 5:52 PM Nathan Schneider ***@***.***> wrote:
Oh I see you asked about "(1)" as well. I would definitely advocate NUM
for that.
—
Reply to this email directly, view it on GitHub
<#518 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWN5FFZSXAWUZUPVIKDY7RB4BAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3DSMRXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO. |
Because of the coref misc annotations or because it's part of multiple
annotation layers outside UD? I don't think it would be an impossible task
to retokenize and it would make things more consistent
…On Sat, Apr 27, 2024, 6:17 PM Nathan Schneider ***@***.***> wrote:
GUM doesn't. The tokenization varies by treebank and it would be
impractical to change the tokenization IMO.
—
Reply to this email directly, view it on GitHub
<#518 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWNOLHQB4NDAY4X3RT3Y7REZVAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3TOMBXHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I think @amir-zeldes is happy with the GUM tokenization of list item markers as it is easier on annotators (who do it manually and then don't have to go through the effort of attaching punctuation in the tree). For EWT I don't want to mess with LDC tokenization as it will break compatibility with Penn trees. |
Yeah, I basically think the decision to tokenize parts of a marker like "a.)" is wrong, it's confusing to me, leads to unmatched brackets and ambiguous period tokens, and you only end up reattaching them as punct for no real gain that I can see. Are there stats on what other corpora/languages do with these? As for POS, I'd be willing to consider NUM for things that are numeric. Is there a regex you have in mind for what to include in that? I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks, little pointing hands and the like, I think those should be SYM (X was a legacy thing can't remember what we were imitating there) |
You mean bullets? The guidelines say PUNCT for those as they are not pronounced, and given the disagreement about this among the core group the conclusion was to stick with the status quo. |
upos is not super important for me so I wouldn't fight for that too much, but I think "discourse" for numerical LS but "punct" for symbols is wrong/potentially confusing for parsing models, so I don't want to implement that. I don't suppose anyone wants to use PUNCT/discourse for bullets? |
Nope. There's no perfect solution that everybody likes but it's better to have a solution. |
No question, just still seems wrong to me. So are we doing discourse for this release already? |
Yes |
OK, so if upos for non-bullets is NUM a la Chris, what is the NumForm for things like (a)? I assume NumType is Ord right? |
Ord seems odd because it usually corresponds to a suffix, but in any case I'm not going to mess with whatever is in EWT. |
Looks like EWT has it as upos NUM with Card + Digit for numerical ones like "(1)", and upos NUM also for "(A)", but with no NumType or NumForm... That doesn't seem right/would ruin the current state in GUM where NUM guarantees that we have a NumType and NumForm. I'm happy to change all LS that are not bullets and have some kind of ordering meaning to NUM, but then I think they should have a NumType and NumForm - would you be OK adding that to EWT? |
Looks like this is #465, and we were waiting to try to figure out a complete solution to LS. :) I'll comment there. |
It is interesting that ordinal numerals (their function is to indicate sequential order) are tagged as ADJ. Hence, I see no reason that sequence indicators should be labled as quantifiers, but, as usual, I probably am not seeing the whole picture. Can you elaborate. |
True, but we generally try to keep a uniform UPOS even where a word has a slightly different function (at least if the form and meaning of the word itself is the same). "3 books" and "3) books" show different functions but they draw on a shared concept of 'three'. Ordinals are actually spelled differently ("third", "3rd") so there is less pressure to keep the same UPOS I suppose. |
Sequential markers like "1.", "(a)", and so forth lacked a good policy for how they should attach, but this was just clarified as
discourse
: UniversalDependencies/docs#1027I will update EWT, where they are currently
nummod
. I tried several approaches to query these—sentence-initial nummods, nummods modifying a non-nominal, etc. The approach that worked best was to query for nummods with ".", ")", "]" immediately after the number:This excludes NUM-headed nummods, which are area codes in telephone numbers (this should be fixed separately).
In GUM they are
dep
. Because GUM has more genres than EWT I would guess the punctuation associated with enumerators (if any) will be more varied. But theLS
tag can also help identify them.The text was updated successfully, but these errors were encountered: