Add `SpaceAfter=No` feature. #4

foxik · 2015-10-16T14:53:33Z

I am currently working on a tokenizer and I would like to be able to reconstruct the original (untokenized) text of UD_English. The CoNLL-U format allows this using the SpaceAfter=No feature in the MISC column. It would help me to both train the tokenizer on training data, and evaluate it on the testing data.

I have created a script which merges the CoNLL-U and the original English Web Treebank, so I have the SpaceAfter=No feature in my copy of the data. Nevertheless, it would be great if the official UD_English contained the SpaceAfter=No, so that I could train/evaluate the tokenizer on public data.

Would you be willing to add the SpaceAfter=No features? I am happy to help in any way, but @sebschu told me in #1 that you do not use pull requests, so I am not sure how.

The text was updated successfully, but these errors were encountered:

manning · 2015-10-16T19:53:07Z

This would be great to have, and we'd be happy to have your help on producing this. Since you already have the original web treebank, what would be useful would be to have that output for each of the original files of the web treebank. E.g., if there was something like a simple one-token-per-line

word TAB SpaceAfter=No/blank

two column format for each file, then that would be very easy to merge.

foxik · 2015-10-19T10:09:18Z

Great that you are interested. I will post these files later today.

Note that the words in the articles and in UD_English corpus sometimes differ slightly -- various Unicode characters are transliterated, one fullstop is included in UD_English not present in the original corpus, there are four error introduced in UD_English (in the UD_English, there are words "fin", "gam", "fin" and "compan", while in the articles there are "fine", "game", "fine" and "company"). Also part of one article is not present in UD_English (which is correct, it is a long list of various hyperlinks).

I will use the words found in the UD_English corpus instead of the original treebank, as I assume you use those in your annotation files.

Although I understand that the deadline for 1.2 version is quite close, it would be extremely useful for me if the SpaceAfter=No feature would be present in the 1.2 release. I would therefore like to ask you to consider adding this already in 1.2 version, please.

foxik · 2015-10-20T18:25:41Z

Sorry for the delay, it took me longer to generate the individual files.

The files of the original corpus in the described format are available at http://ufallab.ms.mff.cuni.cz/~straka/eng_web_tbk.spaces.tar.xz . When the words in the original corpus and in CoNLL-U differ, the words from CoNLL-U are used. The parentheses are encoded using ( and ). If you would like something differently, just tell me.

Thansk,
Milan

PS: If you are interested, the files were generated using the original corpus, 1.1 English CoNLL-U and the following script https://github.com/foxik/UD_English/blob/space_after/merge_to_anot.pl , which finds a pairing between the CoNLL-U sentences and original corpus tokens.

ngiordani · 2015-10-27T21:02:18Z

Hi @foxik -- to keep you posted, we've integrated the feature you created. It's not on the UD repo yet, but it's our internal files and will make it into v1.2.

foxik · 2015-10-28T15:20:12Z

Great news, thanks a lot!

foxik changed the title ~~Add SpaceAfter=No annotation.~~ Add SpaceAfter=No feature. Oct 16, 2015

manning added the enhancement label Oct 16, 2015

ngiordani closed this as completed Oct 27, 2015

manning added this to the Release 1.2 internal data freeze milestone Nov 14, 2015

manning assigned sebschu Nov 14, 2015

nschneid mentioned this issue Dec 30, 2024

Implement nmod:desc for honorific pre-nominal titles: Mr., Dr., etc. #561

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `SpaceAfter=No` feature. #4

Add `SpaceAfter=No` feature. #4

foxik commented Oct 16, 2015

manning commented Oct 16, 2015

foxik commented Oct 19, 2015

foxik commented Oct 20, 2015

ngiordani commented Oct 27, 2015

foxik commented Oct 28, 2015

Add SpaceAfter=No feature. #4

Add SpaceAfter=No feature. #4

Comments

foxik commented Oct 16, 2015

manning commented Oct 16, 2015

foxik commented Oct 19, 2015

foxik commented Oct 20, 2015

ngiordani commented Oct 27, 2015

foxik commented Oct 28, 2015

Add `SpaceAfter=No` feature. #4

Add `SpaceAfter=No` feature. #4