-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve readme documentation on how to provide a new crawler #80
Comments
@hugolpz Can I help adding steps to readme about how to add new crawler starting with basics of installing python? |
Hello Aayush. thank you for jumping in.
This would help yes. I made a large review of this project but I'am JS dev so I walk quite blind here. Yet I think this project isn't that hard to contribute to : the main obstacle is 1. how to start and 2. what kind of output each crawler must provides, how, where. @brawer, would you temporarily grant me maintainer status so I could handle the possible PRs ? I would be happy to give that userright back as soon as a new, active python dev emerges. |
@hugolpz Sure, looking to add required dependencies information in README :) |
@hugolpz Getting an error no module found : corpuscrawler when running |
JS dev here, I try to help around but I don't know python. I can look for python help but it will need at least 5 days. |
This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.
In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.
Wanted
If an user want to add a language such as Catalan from Barcelona (
ca
,cat
: missing). What do he needs to jump in quickly ? What should he provide ?util.py
: store functions uses by multiple languages crawlersmain.py
: stores the 1000+ crawlers calls, run them all.crawl_{iso}.py
: stores language-specific copora's source url and processing functions.crawl_ca_valencia.py
API (to complete)
Defined functions within
util.py
, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.Some tools
daterange(start, end)
: __urlpath(url)
: __urlencode(url)
: __Main element
class Crawler(object):
__init__(self, language, output_dir, cache_dir, crawldelay)
: __get_output(self, language=None)
: __close(self)
: __fetch(self, url, redirections=None, fetch_encoding='utf-8')
: __fetch_content(self, url, allow_404=False)
: __fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True)
: __is_fetch_allowed_by_robots_txt(self, url)
: __crawl_pngscriptures_org(self, out, language)
: ___find_urls_on_pngscriptures_org(self, language)
: __crawl_abc_net_au(self, out, program_id)
: __crawl_churchio(self, out, bible_id)
: __crawl_aps_dz(self, out, prefix)
: __crawl_sverigesradio(self, out, program_id)
: __crawl_voice_of_america(self, out, host, ignore_ascii=False)
: __set_context(self, context)
: __Some crawlers for multi-languages sites
crawl_bbc_news(crawler, out, urlprefix)
: __crawl_korero_html(crawler, out, project, genre, filepath)
: __write_paragraphs(et, out)
: __crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False)
: __crawl_radio_free_asia(crawler, out, edition, start_year=1998)
: __crawl_sputnik_news(crawler, out, host)
: __crawl_udhr(crawler, out, filename)
: __crawl_voice_of_nigeria(crawler, out, urlprefix)
: __crawl_bibleis(crawler, out, bible)
: __crawl_tipitaka(crawler, out, script)
: __find_wordpress_urls(crawler, site, **kwargs)
: __Some cleaners
unichar(i)
: __replace_html_entities(html)
: __cleantext(html)
: __clean_paragraphs(html)
: __extract(before, after, html)
: __fixquotes(s)
: __Shorter way to do so
In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.
@sffc, @brawer : anyone could help on that ?
The text was updated successfully, but these errors were encountered: