-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Wikipedia crawler ? (300+ languages) #78
Comments
Discussion engaged with the Wikimedia Foundation's Dumb-Generation project. |
Python processing :
One presentation is in italian but has some interesting nugets: here. The gist: |
@hugolpz That's impressive. |
@GTOqaz there are some upcoming Google crawling on 2000 languages, I hope they will make some data available, especially frequency lists. |
There are ready-to-download open licence Wikipedia corpora available.
|
A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).
Assess interest
Crawling via API
By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to
max=n
articles.Given an iso code such as Ndonga's
ng
:Wikipedia API provides text
Various formats available:
format
: The format of the output.jsont
: Output data in JSON format.jsonfmt
: Output data in JSON format (pretty-print in HTML).nonet
: Output nothing.phpt
: Output data in serialised PHP format.phpfmt
: Output data in serialised PHP format (pretty-print in HTML).rawfmt
: Output data, including debugging elements, in JSON format (pretty-print in HTML).xmlt
: Output data in XML format.xmlfmt
: Output data in XML format (pretty-print in HTML).List of Wikipedia (~300)
List of articles per Wikipedia
For convenience, I use the tiny Ndonga (
ng
) Wikipedia (8 articles), easier to explore by hand.For larger demo, you could also inspect similar URLs with the iso of :
Namespaces
On all wikis. See also here
0
: (main)1
: Talk:2
: User:3
: User_talk:Dumps' & paths
Using Wikipedia extractors ?
Hybrid approach
util.py
, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.cc: @brawer
The text was updated successfully, but these errors were encountered: