-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: DET table for English #971
Comments
For words like another, there exists the feature For both I can suggest a treatment as I think that the label |
I believe it is generally true, but sometimes in some languages people treat one string as homonymous even within the |
Will this be added to the universal documentation as a possible value? If most treebanks don't use it it may be easier to stick with |
Are we any closer to a definitive answer about what to do with |
It appears that I'd rather not start using it for English unless other languages are planning to start using it. My familiarity with other languages is limited but it appears that some French treebanks use |
Makes sense. @amir-zeldes ? |
Fine by me, I can add it in GUM/GENTLE. So just |
Several others marked with ??? above are labeled differently between GUM and EWT: but if we're happen with the current labels in GUM, then I suppose the change needed would be to add those features to EWT, not alter them in GUM |
Art should just be a/an/the. The rest should be Tot, Ind, or Rcp I think.
Plus PronType=Neg or NumType=Card as appropriate.
…On Mon, Sep 18, 2023, 3:52 PM John Bauer ***@***.***> wrote:
Several others marked with ??? above are labeled differently between GUM
and EWT: any, both, every, no, some, all, either, neither
but if we're happen with the current labels in GUM, then I suppose the
change needed would be to add those features to EWT, not alter them in GUM
—
Reply to this email directly, view it on GitHub
<#971 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHQRL5RBZC4RV3V3JC2FE3X3CQ7DANCNFSM6AAAAAA3XVZUJY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
How about the following as the features for DET (xpos DT or PDT or WDT):
* Only DET as a predeterminer (N.B. I am not thrilled with the current practice of tagging "such", "quite", and "many" as DET in their predeterminer uses as it seems like unnecessary multiplication of tag ambiguity, but it follows from treating every PDT as DET: UniversalDependencies/UD_English-EWT#412)
@amir-zeldes I think this one is an error - should be JJ/ADJ |
I've written a Grew-match request which "implements" @nschneid's table.
|
Thanks @bguil. Looking at the queries made me realize there were issues with the last 2 rows of the table. |
Because no indicates a negation of all items in a set, should it be |
Actually, I've changed my mind: let's go with plain |
That would be confusing. |
I find
Why not start using |
At the end of the day, at least for English, the From my perspective the best we can do now is converge on a table for English that aligns with https://universaldependencies.org/u/feat/PronType.html. A finer-grained universal theory of these features might be worth developing in the future, but I don't see English morphology as providing much guidance beyond what is already in the universal guidelines (the main opportunity I do see would be to add a feature to group together the -ever items). |
Well, of course it is. All
Probably we will always have some kind of indeterminacy: else, at one extreme, we should have a specific tag for each form. But probably we can already do much with what we already have: for example, I was considering that every might get a With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an). Maybe this would avoid to being forced to choose a single label which will always be a "short blanket". |
Historically yes, synchronically no. It would be an error to say/write "I have an other idea". Note the syllabification of /əˈnʌð.ɚ/ (not /ən ˈʌð.ɚ/). |
The orthography might mirror the phonetic coalescence of the two elements, but couldn't they still be considered "syntactically active"? Besides, how semantically different is another from an+other? Now this might be a Pindaric flight, but if I think to Italian un altro / un'altra, it is written separate though phonetically it is just one unit, and there is absolutely no way to distinguish any different use as supposedly an other vs. another. Also, please correct me if I am wrong, but is it at all possible to use an other instead of the univerbation? My point is that we are probably (once again) mislead by orthographic conventions here. |
I'm not sure how to test for a notion of "syntactically active" but my sense is that "another" is extremely well established as a word of English. It is listed in dictionaries (and not as a spelling variant of "an other"). As far as I am aware there is no tradition of tokenizing it as two words. (Just like we don't split "without" or "spreadsheet" or what have you.) So splitting it in UD would cause confusion IMO. (There are probably more syntactic arguments that can be made here: e.g. stranding is possible, unlike semantically similar adjectives: I ate one muffin and now I want another/*an additional. But the bottom line is that another is normally regarded/tokenized as one word so it would require an extremely compelling reason to change that.) Similar expressions in other languages may not be as far along in grammaticalization; we cannot expect the UD tree to be exactly parallel across languages. |
Here we are getting back to the issue of "what is a word"... I am just perplexed that a string which is by all means transparent in all its components and behaviour is kept together just for (motivated) orthographic conventions (as a note: in my opinion other should be considered a determiner rather than an adjective). The same cannot be said for without, I think. It may be just me, but I see much more confusion in keeping another together, while having an as a separate element in all other cases. But I fear the discussion would grow much more over the topic of |
Splitting "another" would also lead to inconsistency with the tokenization in non-UD corpora, which I am very happy to say we have not diverged from so far. It's quite nice that tokenizers trained on the ~3M tokens in OntoNotes work quite well for UD data, since it's the same standard for what counts as a token. |
In fact, after some thought I recognise that splitting another might not be the ideal choice, principally because both split elements would be functional, and since it is ultimately irrelevant if we just consider that features like |
@amir-zeldes are you OK with the table in #971 (comment)? |
Note that "lemma=other" exists with various DETs. |
Sorry, probably, I need to find a free moment to go over it in detail - will post here again once I've had a chance |
OK, I've had a closer look now - the only thing I would change there is not treating "both" as a cardinal number. I agree it normally implies that there are two things, but it's still different from a cardinal number which gets a NumType IMO. We also don't assign these features to words like "pair" or "decade", even though they imply a count. Half seems OK though, since that's actually the English name of that number. |
OK, removed |
I'd just like to notice that As for considering a |
OK, the table is now implemented in GUM as well (see the UD dev branch for results). Let me know if you notice anything off! |
As discussed here, I suggest we create a DET table with the known determiners and their features for English. That way, we can unify that across treebanks.
UniversalDependencies/UD_English-EWT#416 (comment)
As part of this process, I also suggest including features for words such as
another
, possibly by inventing a new feature which matches. Currently EWT has no features onanother
, whereas GUM hasPronType=Art
.So, for example, we have
furthermore, there are a couple instances of a typo not getting the proper features (in EWT, I believe): his/this, Thi$/this
and then there's PUD which is labeling "those" with the lemma "those" as opposed to "these"
there's also other DETs with the xpos PDT, such as
all
,quite
,half
,both
,such
,nary
,many
The text was updated successfully, but these errors were encountered: