Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpaceAfter=No feature. #4

Closed
foxik opened this issue Oct 16, 2015 · 5 comments
Closed

Add SpaceAfter=No feature. #4

foxik opened this issue Oct 16, 2015 · 5 comments
Assignees

Comments

@foxik
Copy link
Member

foxik commented Oct 16, 2015

I am currently working on a tokenizer and I would like to be able to reconstruct the original (untokenized) text of UD_English. The CoNLL-U format allows this using the SpaceAfter=No feature in the MISC column. It would help me to both train the tokenizer on training data, and evaluate it on the testing data.

I have created a script which merges the CoNLL-U and the original English Web Treebank, so I have the SpaceAfter=No feature in my copy of the data. Nevertheless, it would be great if the official UD_English contained the SpaceAfter=No, so that I could train/evaluate the tokenizer on public data.

Would you be willing to add the SpaceAfter=No features? I am happy to help in any way, but @sebschu told me in #1 that you do not use pull requests, so I am not sure how.

@foxik foxik changed the title Add SpaceAfter=No annotation. Add SpaceAfter=No feature. Oct 16, 2015
@manning
Copy link
Contributor

manning commented Oct 16, 2015

This would be great to have, and we'd be happy to have your help on producing this. Since you already have the original web treebank, what would be useful would be to have that output for each of the original files of the web treebank. E.g., if there was something like a simple one-token-per-line

word TAB SpaceAfter=No/blank

two column format for each file, then that would be very easy to merge.

@foxik
Copy link
Member Author

foxik commented Oct 19, 2015

Great that you are interested. I will post these files later today.

Note that the words in the articles and in UD_English corpus sometimes differ slightly -- various Unicode characters are transliterated, one fullstop is included in UD_English not present in the original corpus, there are four error introduced in UD_English (in the UD_English, there are words "fin", "gam", "fin" and "compan", while in the articles there are "fine", "game", "fine" and "company"). Also part of one article is not present in UD_English (which is correct, it is a long list of various hyperlinks).

I will use the words found in the UD_English corpus instead of the original treebank, as I assume you use those in your annotation files.

Although I understand that the deadline for 1.2 version is quite close, it would be extremely useful for me if the SpaceAfter=No feature would be present in the 1.2 release. I would therefore like to ask you to consider adding this already in 1.2 version, please.

@foxik
Copy link
Member Author

foxik commented Oct 20, 2015

Sorry for the delay, it took me longer to generate the individual files.

The files of the original corpus in the described format are available at http://ufallab.ms.mff.cuni.cz/~straka/eng_web_tbk.spaces.tar.xz . When the words in the original corpus and in CoNLL-U differ, the words from CoNLL-U are used. The parentheses are encoded using ( and ). If you would like something differently, just tell me.

Thansk,
Milan

PS: If you are interested, the files were generated using the original corpus, 1.1 English CoNLL-U and the following script https://github.com/foxik/UD_English/blob/space_after/merge_to_anot.pl , which finds a pairing between the CoNLL-U sentences and original corpus tokens.

@ngiordani
Copy link
Contributor

Hi @foxik -- to keep you posted, we've integrated the feature you created. It's not on the UD repo yet, but it's our internal files and will make it into v1.2.

@foxik
Copy link
Member Author

foxik commented Oct 28, 2015

Great news, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants