-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr ponta grossa #45
Pr ponta grossa #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @antoniovendramin – many thanks for you interest and your PR.
I was studying your code and I made some minor comments that might help you improve the readability of this contribution. I hope they're helpful : )
Many thanks
pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra }) | ||
|
||
if pdf_infos: | ||
menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think comprehensions are more readable than map
in most cases. For example, this line would be easier to read as:
menor_ano_da_pagina = min(p['ano'] for p in pdf_infos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if the variables are also in english instead of portuguese?
@url http://www.pontagrossa.pr.gov.br/diario-oficial/ | ||
@returns requests 1 | ||
""" | ||
return self.scrape_page(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really need to refine scrape_page
? As all parse
does is calling scrape_page
, the body of scrape_page
could be the parse
method.
return self.scrape_page(response) | ||
|
||
def scrape_page(self, response): | ||
pdf_links = response.css(".view-content .field a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You started the file using '
to define string, and you changed from here on. Can we have a more consistent use of quotes? I suggest always using '
except when there's the need to use a single quote inside the string.
mes = pdf_link_info.group(2) | ||
dia = pdf_link_info.group(3) | ||
is_extra = "complementar" in pdf_link_text | ||
pdf_infos.append({ "ano" : ano, "mes" : mes, "dia":dia, "url": pdf_file_name, "is_extra_edition": is_extra }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not extracting this bit (lines 27-40) into a custom method? This method could be a generator (way better than appending to a list).
This would like something like:
@staticmethod
def pdfs_info(links):
link_pattern = '.*/diario-oficial/_?(\d{4})-(\d{2})-(\d{2}).*.pdf'
for link in links:
file_name = pdf_link.css('::attr(href)').extract_first()
link_text = pdf_link.css('::text').extract_first()
if 'sem_atos' in file_name:
continue
info = re.search(link_pattern, file_name)
year = int(info.group(1))
if year < self.ano_minimo:
continue
month, day = int(info.group(2)), int(info.group(3))
yield {
'date': date(year, month, day),
'is_extra_edition': 'complementar' in link_text
}
if pdf_infos: | ||
menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos)) | ||
if menor_ano_da_pagina >= self.ano_minimo: | ||
next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could increase readability here splitting the logic and using f-strings, couldn't we?
url_path = response.css('.pager-next a::attr(href)').extract_first()
next_page_url = f'http://www.pontagrossa.pr.gov.br{url_path}'
However @rennerocha might have a better strategy to get absolute URLs from relative URLs in Scrapy ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a relative URL, you should use urlparse.urljoin instead of doing string manipulation.
As this is something very common to appear during the scraping, Scrapy has this method in the response object:
https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.urljoin
next_page_url = response.urljoin(response.css(".pager-next a::attr(href)").extract_first())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response.urljoin
for the win 🎉
menor_ano_da_pagina = min(map(lambda p: p["ano"], pdf_infos)) | ||
if menor_ano_da_pagina >= self.ano_minimo: | ||
next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first()) | ||
yield scrapy.Request(next_page_url, self.scrape_page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do rename scrape_page
as parse
I guess we can omit the second argument of Request
here because self.parse
is the default (but I'm not 100% sure about it, would you mind testing it if that's the case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.parse
is the default callback for a Request
. Get rid of self.scrape_page
and put everything in self.parse
:-)
next_page_url = "{0}{1}".format("http://www.pontagrossa.pr.gov.br", response.css(".pager-next a::attr(href)").extract_first()) | ||
yield scrapy.Request(next_page_url, self.scrape_page) | ||
for pdf_info in pdf_infos: | ||
date = "{0}-{1}-{2}".format(pdf_info["ano"], pdf_info["mes"], pdf_info["dia"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pdf_info
returns a datetime.date
(as I suggest above) this could be simplified as:
gazette_date = pdf_info['date'].strftime('%Y-%m-%d')
(I just avoid using variables called date
because if you from datetime import date
you end up overwriting the date
object).
municipality_id=self.MUNICIPALITY_ID, | ||
power="executive_legislature", | ||
scraped_at=dt.datetime.utcnow() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this lines from 50 and on have an extra indentation level. You could delete 4 spaces in the beginning of each of them ; )
Remember to update the cities.md |
33a78d1
to
23cccff
Compare
Accepted all suggestions. |
This municipality not only hasn't selectable text but has a printed pdf inside another. As in the bottom left of the first page: |
Thanks for the contribution, @antoniovendramin. |
Ponta Grossa são páginas simples com a data no nome do arquivo PDF.
Legislativo e executivo são no mesmo PDF.