-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language detection default for 'unknown' language #86
Comments
Ok, this sounds like a feasible request. |
|
I am aware of the |
Ha, right, I was already wondering if there was something like that. Good idea. |
Sounds good! |
Ok, I added code to handle 'und' languages. |
Thank you! I have tested 'und' in Ucto's language detection mode. Results appear much more reliable than before! See this remarkable example: JSTOR.music.00656.p.1.s.7 I had 1477 input files for testing. However, 25 of these gave empty output. And a message in *stderr. I attach them for your convenience. I saw at least one where there's only non-Latin script (file: JSTOR.music.01437). That text I think should nevertheless be incorporated in FoLiA. I have yet to check what happens when there is mixed Latin and non-Latin text. Another is more unclear, the text seems just plain English to me, but Ucto complains ""ucto: ucto: conflicting language(s) assigned"" and returns an empty file (see: JSTOR.music.00072 and 17 more files). UCTO.FailLangDetect.20220407.MRE.tar.gz Again: thanks! Ucto has already been greatly improved for our purposes! |
I added some code to avoid the At the moment we guess some sentence bounds, use the detected fragments to detect the language, and then tokenize the longest utterance within the same language. This works quite well, but not always. As libtextcat sometimes makes strange decisions. example: This is first split at the '.' in St., then the first part: When both parts would have been seen as English, then it would be correctly seen as 1 Sentence. This problem is NOT resolvable by Ucto |
Thank you kosloot! Also for the further explanations about how this works. We have now run this on several thousands of files: not a single one failed. I consider this matter closed. |
Hi,
We have thousands of article abstracts. A lot are in mixed languages, the actual languages present are unknown, there are no delimiters between the segments of an abstract in different languages.
So we want to recognize the languages on the sentence level. This requires us to use UCTO, for prior sentence splitting.
So we use 'detectlanguages', with a limited set of those, the most likely ones, Ucto parameter:
As we primarily want to recognize and retain the sentences in English, we are not helped by Ucto's behaviour of labeling everything it does not know as our first language, ie. 'eng'. That defeats our purposes. We would be much obliged if you could also equip Ucto with the FoLiA-langcat parameter:
--lang= use 'lan' for unindentified text. (default 'nld')
In fact, I would not at all mind if the default were set to 'unk' (for: unknown).
After all, it is rather disconcerting to see a sentence written in a non-Latin script such as the Thai here being labeled: 'eng'.
Thank you!
The text was updated successfully, but these errors were encountered: