Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr ponta grossa #45

Merged
merged 7 commits into from
May 31, 2018
Merged

Pr ponta grossa #45

merged 7 commits into from
May 31, 2018

Conversation

antoniovendramin
Copy link
Contributor

Ponta Grossa são páginas simples com a data no nome do arquivo PDF.
Legislativo e executivo são no mesmo PDF.

Copy link
Contributor

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @antoniovendramin – many thanks for you interest and your PR.

I was studying your code and I made some minor comments that might help you improve the readability of this contribution. I hope they're helpful : )

Many thanks

pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra })

if pdf_infos:
menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think comprehensions are more readable than map in most cases. For example, this line would be easier to read as:

menor_ano_da_pagina =  min(p['ano'] for p in pdf_infos)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if the variables are also in english instead of portuguese?

@url http://www.pontagrossa.pr.gov.br/diario-oficial/
@returns requests 1
"""
return self.scrape_page(response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really need to refine scrape_page? As all parse does is calling scrape_page, the body of scrape_page could be the parse method.

return self.scrape_page(response)

def scrape_page(self, response):
pdf_links = response.css(".view-content .field a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You started the file using ' to define string, and you changed from here on. Can we have a more consistent use of quotes? I suggest always using ' except when there's the need to use a single quote inside the string.

mes = pdf_link_info.group(2)
dia = pdf_link_info.group(3)
is_extra = "complementar" in pdf_link_text
pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extracting this bit (lines 27-40) into a custom method? This method could be a generator (way better than appending to a list).

This would like something like:

    @staticmethod
    def pdfs_info(links):
        link_pattern = '.*/diario-oficial/_?(\d{4})-(\d{2})-(\d{2}).*.pdf'
        for link in links:
            file_name = pdf_link.css('::attr(href)').extract_first()
            link_text = pdf_link.css('::text').extract_first()

            if 'sem_atos' in file_name:
                continue

            info = re.search(link_pattern, file_name)
            year = int(info.group(1))
            if year < self.ano_minimo:
                continue

            month, day = int(info.group(2)), int(info.group(3))
            yield {
                'date': date(year, month, day),
                'is_extra_edition': 'complementar' in link_text
            }

if pdf_infos:
menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos))
if menor_ano_da_pagina >= self.ano_minimo:
next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could increase readability here splitting the logic and using f-strings, couldn't we?

url_path = response.css('.pager-next a::attr(href)').extract_first()
next_page_url = f'http://www.pontagrossa.pr.gov.br{url_path}'

However @rennerocha might have a better strategy to get absolute URLs from relative URLs in Scrapy ; )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a relative URL, you should use urlparse.urljoin instead of doing string manipulation.

As this is something very common to appear during the scraping, Scrapy has this method in the response object:
https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.urljoin

next_page_url = response.urljoin(response.css(".pager-next a::attr(href)").extract_first())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.urljoin for the win 🎉

menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos))
if menor_ano_da_pagina >= self.ano_minimo:
next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())
yield scrapy.Request(next_page_url, self.scrape_page)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do rename scrape_page as parse I guess we can omit the second argument of Request here because self.parse is the default (but I'm not 100% sure about it, would you mind testing it if that's the case?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.parse is the default callback for a Request . Get rid of self.scrape_page and put everything in self.parse :-)

next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first())
yield scrapy.Request(next_page_url, self.scrape_page)
for pdf_info in pdf_infos:
date = "{0}-{1}-{2}".format(pdf_info["ano"], pdf_info["mes"], pdf_info["dia"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pdf_info returns a datetime.date (as I suggest above) this could be simplified as:

    gazette_date = pdf_info['date'].strftime('%Y-%m-%d')

(I just avoid using variables called date because if you from datetime import date you end up overwriting the date object).

municipality_id=self.MUNICIPALITY_ID,
power="executive_legislature",
scraped_at=dt.datetime.utcnow()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this lines from 50 and on have an extra indentation level. You could delete 4 spaces in the beginning of each of them ; )

@alfakini alfakini mentioned this pull request May 24, 2018
@giovanisleite
Copy link
Contributor

Remember to update the cities.md

@antoniovendramin
Copy link
Contributor Author

Accepted all suggestions.
Thanks!

@Irio
Copy link
Contributor

Irio commented May 31, 2018

This municipality not only hasn't selectable text but has a printed pdf inside another.

As in the bottom left of the first page:
http://www.pontagrossa.pr.gov.br/files/diario-oficial/2018-05-01-ed2303.pdf

@Irio Irio merged commit 23cccff into okfn-brasil:master May 31, 2018
@Irio
Copy link
Contributor

Irio commented May 31, 2018

Thanks for the contribution, @antoniovendramin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants