-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpaceAfter=No
feature.
#4
Comments
This would be great to have, and we'd be happy to have your help on producing this. Since you already have the original web treebank, what would be useful would be to have that output for each of the original files of the web treebank. E.g., if there was something like a simple one-token-per-line word TAB SpaceAfter=No/blank two column format for each file, then that would be very easy to merge. |
Great that you are interested. I will post these files later today. Note that the words in the articles and in UD_English corpus sometimes differ slightly -- various Unicode characters are transliterated, one fullstop is included in UD_English not present in the original corpus, there are four error introduced in UD_English (in the UD_English, there are words "fin", "gam", "fin" and "compan", while in the articles there are "fine", "game", "fine" and "company"). Also part of one article is not present in UD_English (which is correct, it is a long list of various hyperlinks). I will use the words found in the UD_English corpus instead of the original treebank, as I assume you use those in your annotation files. Although I understand that the deadline for 1.2 version is quite close, it would be extremely useful for me if the |
Sorry for the delay, it took me longer to generate the individual files. The files of the original corpus in the described format are available at http://ufallab.ms.mff.cuni.cz/~straka/eng_web_tbk.spaces.tar.xz . When the words in the original corpus and in CoNLL-U differ, the words from CoNLL-U are used. The parentheses are encoded using ( and ). If you would like something differently, just tell me. Thansk, PS: If you are interested, the files were generated using the original corpus, 1.1 English CoNLL-U and the following script https://github.com/foxik/UD_English/blob/space_after/merge_to_anot.pl , which finds a pairing between the CoNLL-U sentences and original corpus tokens. |
Hi @foxik -- to keep you posted, we've integrated the feature you created. It's not on the UD repo yet, but it's our internal files and will make it into v1.2. |
Great news, thanks a lot! |
I am currently working on a tokenizer and I would like to be able to reconstruct the original (untokenized) text of UD_English. The CoNLL-U format allows this using the
SpaceAfter=No
feature in the MISC column. It would help me to both train the tokenizer on training data, and evaluate it on the testing data.I have created a script which merges the CoNLL-U and the original English Web Treebank, so I have the
SpaceAfter=No
feature in my copy of the data. Nevertheless, it would be great if the official UD_English contained theSpaceAfter=No
, so that I could train/evaluate the tokenizer on public data.Would you be willing to add the
SpaceAfter=No
features? I am happy to help in any way, but @sebschu told me in #1 that you do not use pull requests, so I am not sure how.The text was updated successfully, but these errors were encountered: