New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Pr ponta grossa #45

Merged

Irio merged 7 commits into okfn-brasil:master from beothorn:pr_ponta_grossa

May 31, 2018

Contributor

antoniovendramin commented May 19, 2018

Ponta Grossa são páginas simples com a data no nome do arquivo PDF.
Legislativo e executivo são no mesmo PDF.

cuducos reviewed

View reviewed changes

Contributor

cuducos left a comment

Hi @antoniovendramin – many thanks for you interest and your PR.

I was studying your code and I made some minor comments that might help you improve the readability of this contribution. I hope they're helpful : )

Many thanks

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                          pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra })
+                      if pdf_infos:
+                          menor_ano_da_pagina =  min(map(lambda p: p["ano"], pdf_infos))

Contributor

cuducos May 20, 2018

I think comprehensions are more readable than map in most cases. For example, this line would be easier to read as:

menor_ano_da_pagina =  min(p['ano'] for p in pdf_infos)

Collaborator

rennerocha May 21, 2018

Would it be better if the variables are also in english instead of portuguese?

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                      @url http://www.pontagrossa.pr.gov.br/diario-oficial/
+                      @returns requests 1
+                      """
+                      return self.scrape_page(response)

Contributor

cuducos May 20, 2018

Is it really need to refine scrape_page? As all parse does is calling scrape_page, the body of scrape_page could be the parse method.

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                      return self.scrape_page(response)
+                  def scrape_page(self, response):
+                      pdf_links = response.css(".view-content .field a")

Contributor

cuducos May 20, 2018

You started the file using ' to define string, and you changed from here on. Can we have a more consistent use of quotes? I suggest always using ' except when there's the need to use a single quote inside the string.

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                          mes = pdf_link_info.group(2)
+                          dia = pdf_link_info.group(3)
+                          is_extra = "complementar" in pdf_link_text
+                          pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra })

Contributor

cuducos May 20, 2018

Why not extracting this bit (lines 27-40) into a custom method? This method could be a generator (way better than appending to a list).

This would like something like:

    @staticmethod
    def pdfs_info(links):
        link_pattern = '.*/diario-oficial/_?(\d{4})-(\d{2})-(\d{2}).*.pdf'
        for link in links:
            file_name = pdf_link.css('::attr(href)').extract_first()
            link_text = pdf_link.css('::text').extract_first()

            if 'sem_atos' in file_name:
                continue

            info = re.search(link_pattern, file_name)
            year = int(info.group(1))
            if year < self.ano_minimo:
                continue

            month, day = int(info.group(2)), int(info.group(3))
            yield {
                'date': date(year, month, day),
                'is_extra_edition': 'complementar' in link_text
            }

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                      if pdf_infos:
+                          menor_ano_da_pagina =  min(map(lambda p: p["ano"], pdf_infos))
+                          if menor_ano_da_pagina >= self.ano_minimo:
+                              next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())

Contributor

cuducos May 20, 2018

We could increase readability here splitting the logic and using f-strings, couldn't we?

url_path = response.css('.pager-next a::attr(href)').extract_first()
next_page_url = f'http://www.pontagrossa.pr.gov.br{url_path}'

However @rennerocha might have a better strategy to get absolute URLs from relative URLs in Scrapy ; )

Collaborator

rennerocha May 21, 2018

If you have a relative URL, you should use urlparse.urljoin instead of doing string manipulation.

As this is something very common to appear during the scraping, Scrapy has this method in the response object:
https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.urljoin

next_page_url = response.urljoin(response.css(".pager-next a::attr(href)").extract_first())

Contributor

cuducos May 21, 2018

response.urljoin for the win 🎉

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                          menor_ano_da_pagina =  min(map(lambda p: p["ano"], pdf_infos))
+                          if menor_ano_da_pagina >= self.ano_minimo:
+                              next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())
+                              yield scrapy.Request(next_page_url, self.scrape_page)

Contributor

cuducos May 20, 2018

If we do rename scrape_page as parse I guess we can omit the second argument of Request here because self.parse is the default (but I'm not 100% sure about it, would you mind testing it if that's the case?)

Collaborator

rennerocha May 21, 2018

self.parse is the default callback for a Request . Get rid of self.scrape_page and put everything in self.parse :-)

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                              next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())
+                              yield scrapy.Request(next_page_url, self.scrape_page)
+                              for pdf_info in pdf_infos:
+                                  date = "{0}-{1}-{2}".format(pdf_info["ano"], pdf_info["mes"], pdf_info["dia"])

Contributor

cuducos May 20, 2018

If pdf_info returns a datetime.date (as I suggest above) this could be simplified as:

    gazette_date = pdf_info['date'].strftime('%Y-%m-%d')

(I just avoid using variables called date because if you from datetime import date you end up overwriting the date object).

processing/data_collection/gazette/spiders/pr_ponta_grossa.py Outdated

+                                          municipality_id=self.MUNICIPALITY_ID,
+                                          power="executive_legislature",
+                                          scraped_at=dt.datetime.utcnow()
+                                      )

Contributor

cuducos May 20, 2018

All this lines from 50 and on have an extra indentation level. You could delete 4 spaces in the beginning of each of them ; )

alfakini mentioned this pull request

Closed

Contributor

giovanisleite commented May 30, 2018

Remember to update the cities.md

antoniovendramin added 7 commits

May 30, 2018 14:09


          Create the file of Ponta Grossa

8de7136


          Reading pdf links for gazettes, only for first page

11f4477


          Starting to traverse pages

131b72a


          Get the correct file name

5bb7053


          Finished scraping Ponta Grossa

a24e1b4


          Adjust the code following the instructions from comments

c0a2aeb


          Update the status of Crawler from Ponta Grossa

23cccff

antoniovendramin force-pushed the pr_ponta_grossa branch from 33a78d1 to 23cccff Compare

May 30, 2018 17:15

Contributor Author

antoniovendramin commented May 30, 2018

Accepted all suggestions.
Thanks!

Contributor

Irio commented May 31, 2018

This municipality not only hasn't selectable text but has a printed pdf inside another.

As in the bottom left of the first page:
http://www.pontagrossa.pr.gov.br/files/diario-oficial/2018-05-01-ed2303.pdf

Irio merged commit 23cccff into okfn-brasil:master

Contributor

Irio commented May 31, 2018

Thanks for the contribution, @antoniovendramin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet