-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty string sentence in cv-corpus-5-2020-06-22/en/test.tsv #108
Comments
Strange, I thought the code prevented this, but there it is. Oh I think I see what happened. The string is not empty but the string contains two quote marks. It looks like the string originally was something like...
which contains, HTML tags, a non-printable character " " ( After the HTML and other stuff was removed it was then turned into...
which, as it contains quotes, makes it through the check here corpus.py#L55 which should remove empty strings. Have you listened to common_voice_en_16759015.mp3? Maybe that will give us more info on what the string originally was and how it got validated!? |
Just listened to it. It's someone reading back HTML tags. I guess it was originally something like
and common.py#L69 turned that into
One nice thing is that's the only occurrence of this problem in the en test set and it doesn't occur in the en dev or train set. |
I guess maybe the solution is to add a language specific preprocessor that removes strings that are just ""? |
Thanks for looking into this. Couldn't the solution be more general than removing If we could strip the beginning and ending quotes, then preprocessor would have caught "empty" string sentences. To me, the both of these lines are the same:
I see plenty of instances where transcripts start and end with quotes. Also from a developer's perspective it is confusing to see some transcripts are quoted and others not. Related to the quote topic, I have noticed many transcripts contain Here is examples of double quotes from test:
|
In my experience
and
are not spoken in the same manner. The first, with quotes, is spoken with a bit more inflection with a rising tone in the first word to emphasize the speaker of the sentence is quoting someone else where as the second is spoken with no such effect. Stripping the quotes is thus removing information. Double quotes, generally, indicate escaped quotes. Though this may not be the case for all double quotes in the text. For example
has escaped quotes around the word "Passion". |
In that case, this needs to be make clearer (in documentation or instructions) because from the context neither the reader, nor validator would know that quoted text should be read any differently. But I agree if the transcript contained quoted text, then it is read differently: However, I still believe most quotes surrounding the transcripts are text processing artifacts of some sort. These two transcripts, from my previous comment, for instance, contain quotes within quotes. If it was true what you said about quoted texts are read differently, then how quotes within quotes should be read? """Getting to play someone as unrestricted as a vampire is a thrill,"" she says." I get that as much information as possible should be preserved generally but in these cases I believe the readers, validators, and developers think quoted transcripts (beginning and end) are nothing more than text surrounded by quotes. Perhaps, this is not the right place to address this quote issue. Probably it should be addressed somewhere upstream. If you could point me to the right direction, I'll be happy to follow up. |
As we don't document that one raises the tone of one's voice when reading a question, we also don't document the change in intonation when reading a quotation. This is simply part of what's entailed in "reading aloud". However, I agree with you in that I also do not believe most quotes are surrounding the various sentences are to indicate the sentence is a quotation. In the majority of quoted text, e.g.
the quotes simply are a means to delineate the text from its surroundings. However, there are some cases in which I think the text contains quotes, but I'd have to look in detail at the entire pipeline to really differentiate between these two cases. I think @phirework has a much better view of the entire pipeline than I do. So maybe phirework could chime in? |
Re: OP - Kelly's correct, the original sentence was On the question of too many quotes, it looks like it has to do with the settings we're using for Thanks! |
While processing entries from cv-corpus-5-2020-06-22/en/test.tsv, I have discovered an empty string sentence ("") on line #557 referencing common_voice_en_16759015.mp3. This entry also exists in validated.tsv. I haven't checked if there are more of the same type of errors.
The text was updated successfully, but these errors were encountered: