-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use available corpora for opensubtitles (63 languages) #79
Comments
Sounds great. Send a pull request? |
Hello Sascha / @brawer,
The project also lacks meaningul documentation (#80). It would be inefficient to get a total Python-newbie on Python copy-engineering. I will be more productive on other linguistic diversity issues, here on on @lingua-libre projects. Given how central to web linguistic diversity is this CLDR/UNILEX/Unicode/Google's CorpusCrawler repository, is there an email contact to which I or/and Wikimedia France or/and Wikimedia Foundation could write to ask for more solid support for CorpusCrawler ? Volunteership can do a lot but is too irregular. A dedicated, versatile, paid maintainer supervising ~20² Google's open sources projects, unblocking most key bottlenecks via 4 hours coding sprints and community support would quickly provide a positive ROI. 2020 opens access to skilled workers all around the world. There is surely a long list of open sources projects which would gain of such tiny yet skilled bottlenecks-kicks to move forward. I would be interested to coordinate such email with Wikimedia France and the US Wikimedia Foundation to get a hand of names of that email. (If there is a reasonable >5~10% chances to achieved the intended goal of a skilled, paid maintainer here 4hrs/week in next 2 years). 1: see text above |
Thanks for the chat @brawer. Our online chat will help me conceive better the next phases of Lingualibre and collaboration with crawler. |
Research
Gain
Closest of natural oral corpora.
Links
af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw
There are ready-to-download open licence Wikipedia corpora available.
Parallel sentences
Monolingual sentences
br&en
The text was updated successfully, but these errors were encountered: