Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve readme documentation on how to provide a new crawler #80

Open
hugolpz opened this issue Feb 25, 2021 · 5 comments
Open

Improve readme documentation on how to provide a new crawler #80

hugolpz opened this issue Feb 25, 2021 · 5 comments

Comments

@hugolpz
Copy link

hugolpz commented Feb 25, 2021

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

  • What isn the local structure :
    • util.py : store functions uses by multiple languages crawlers
    • main.py : stores the 1000+ crawlers calls, run them all.
    • crawl_{iso}.py : stores language-specific copora's source url and processing functions.
  • What tools I have :
  • What input(s) : python list of url ?
  • What are the classic parts of a crawler function ?
  • What output format : raw text ? html is fine because a html balise wiper is then used ?
  • Example of easily hackable base-code.

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

  • daterange(start, end): __
  • urlpath(url): __
  • urlencode(url): __

Main element

  • class Crawler(object):
    • __init__(self, language, output_dir, cache_dir, crawldelay): __
    • get_output(self, language=None): __
    • close(self): __
    • fetch(self, url, redirections=None, fetch_encoding='utf-8'): __
    • fetch_content(self, url, allow_404=False): __
    • fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __
    • is_fetch_allowed_by_robots_txt(self, url): __
    • crawl_pngscriptures_org(self, out, language): __
    • _find_urls_on_pngscriptures_org(self, language): __
    • crawl_abc_net_au(self, out, program_id): __
    • crawl_churchio(self, out, bible_id): __
    • crawl_aps_dz(self, out, prefix): __
    • crawl_sverigesradio(self, out, program_id): __
    • crawl_voice_of_america(self, out, host, ignore_ascii=False): __
    • set_context(self, context): __

Some crawlers for multi-languages sites

  • crawl_bbc_news(crawler, out, urlprefix): __
  • crawl_korero_html(crawler, out, project, genre, filepath): __
  • write_paragraphs(et, out): __
  • crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __
  • crawl_radio_free_asia(crawler, out, edition, start_year=1998): __
  • crawl_sputnik_news(crawler, out, host): __
  • crawl_udhr(crawler, out, filename): __
  • crawl_voice_of_nigeria(crawler, out, urlprefix): __
  • crawl_bibleis(crawler, out, bible): __
  • crawl_tipitaka(crawler, out, script): __
  • find_wordpress_urls(crawler, site, **kwargs): __

Some cleaners

  • unichar(i): __
  • replace_html_entities(html): __
  • cleantext(html): __
  • clean_paragraphs(html): __
  • extract(before, after, html): __
  • fixquotes(s): __

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

@Aayush-hub
Copy link

@hugolpz Can I help adding steps to readme about how to add new crawler starting with basics of installing python?

@hugolpz
Copy link
Author

hugolpz commented Mar 14, 2021

Hello Aayush. thank you for jumping in.
I think we can assume ability to install python. Readme.md should just have a section "Requirement" with python version and associated pip dependency

### Requirements
* python x.x+

### Dependencies
`
pip3 instal {package1}
pip3 instal {package2}
pip3 instal {package3}

This would help yes.

I made a large review of this project but I'am JS dev so I walk quite blind here. Yet I think this project isn't that hard to contribute to : the main obstacle is 1. how to start and 2. what kind of output each crawler must provides, how, where.

@brawer, would you temporarily grant me maintainer status so I could handle the possible PRs ? I would be happy to give that userright back as soon as a new, active python dev emerges.

@Aayush-hub
Copy link

@hugolpz Sure, looking to add required dependencies information in README :)

@Aayush-hub
Copy link

@hugolpz Getting an error no module found : corpuscrawler when running main.py. Can you please help debugging it?

@hugolpz
Copy link
Author

hugolpz commented Mar 15, 2021

JS dev here, I try to help around but I don't know python. I can look for python help but it will need at least 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants