-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curitiba #42
Curitiba #42
Conversation
""" | ||
todays_date = dt.date.today() | ||
current_year = todays_date.year | ||
for year in reversed(range(2015, current_year + 1)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the step parameter in range
so you don't need to reverse the list and it is more efficient:
In [5]: list(range(2018, 2014, -1))
Out[5]: [2018, 2017, 2016, 2015]
) | ||
|
||
def parse_month(self, response): | ||
#Count how many pages and iterate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the variable ( page_count
) is self-explanatory, so I don't think this comment is necessary.
}, | ||
callback=self.parse_page, | ||
) | ||
for page_number in range(2,page_count + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 :-) for page_number in range(2, page_count + 1):
|
||
def scrap_not_extra_edition(self, response): | ||
parsed_date = response.meta['parsed_date'] | ||
id = re.findall(r'Id=(\d+)', response.text )[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do that! You have no idea if your regex will match something. You'll get a big IndexError
when it happens!
You can use the re method in response. See https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response for this request is not a valid html. I believe I can't call re from response because I can't get a valid return from any selector, am I correct?
About raising exceptions, if there is no "Id=" I wouldn't want to ignore this error, because it would mean that the website structure changed. I agree that IndexError is a really bad exception to raise when this happens.
Do you have any suggestion on how should I deal with it in a way that it would be clear that something went wrong? Should I throw a more explanatory exception or is there another way to log this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rennerocha @cuducos @Irio We don't know what must be done when an unexpected return happen. Should we fail silently or throw an exception? We guess that is the last change before the PR can be accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always try to handle the problem. But this is a avoidable problem.
In this case I didn't find where in the page this Id=(\d+)
is.
For sure it is a string, so you can use the re_first
method: response.selector.re_first('Id=(\d+)')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.
Remember to update the cities.md |
4da1959
to
3be04c0
Compare
numbers = row.css("td:nth-child(1) span ::text").extract() | ||
pdf_dates = row.css("td:nth-child(2) span ::text").extract() | ||
ids = row.css("td:nth-child(3) a ::attr(data-teste)").extract() | ||
for i in range(len(numbers)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to do that. rows = response.css(".grid_Row")
will return a list of selectors that you can iterate over and process each element:
for row in response.css('.grid_Row'):
number = row.css("td:nth-child(1) span ::text").extract()
pdf_date = row.css("td:nth-child(2) span ::text").extract()
pdf_id = row.css("td:nth-child(3) a ::attr(data-teste)").extract()
# do what you need to do from here...
'__EVENTTARGET' : 'ctl00$cphMasterPrincipal$gdvGrid2' | ||
}, | ||
callback=self.parse_page, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run PEP8 in this file. You need a newline here.
for i in range(12): | ||
yield self.scrap_month(response, i) | ||
|
||
def scrap_month(self, response, month): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is scrape, not scrap.
if id == '0': | ||
yield scrapy.FormRequest.from_response( | ||
response, | ||
headers = {'user-agent': 'Mozilla/5.0'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to change the User-Agent
in this request? You are doing similar requests in other places and didn't changed it. If this is really required for this website, I suggest to set a custom user-agent for all the requests made by this spider.
Include a custom_settings
in the spider as you can see here https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider and set the desired USER_AGENT
key (https://doc.scrapy.org/en/latest/topics/settings.html#user-agent)
|
||
def scrap_not_extra_edition(self, response): | ||
parsed_date = response.meta['parsed_date'] | ||
id = re.findall(r'Id=(\d+)', response.text )[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always try to handle the problem. But this is a avoidable problem.
In this case I didn't find where in the page this Id=(\d+)
is.
For sure it is a string, so you can use the re_first
method: response.selector.re_first('Id=(\d+)')
.
id = re.findall(r'Id=(\d+)', response.text )[0] | ||
return Gazette( | ||
date = parsed_date, | ||
file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
is a built-in function in Python (https://docs.python.org/3.6/library/functions.html#id). Don't use it as a variable name. Give a better name like pdf_id
to avoid problems.
|
||
def scrap_not_extra_edition(self, response): | ||
parsed_date = response.meta['parsed_date'] | ||
id = re.findall(r'Id=(\d+)', response.text )[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.
return Gazette( | ||
date = parsed_date, | ||
file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)], | ||
is_extra_edition= False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
from gazette.items import Gazette | ||
|
||
class PrCuritibaSpider(scrapy.Spider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inherit from BaseGazetteSpider
(https://github.com/okfn-brasil/diario-oficial/blob/master/processing/data_collection/gazette/spiders/base.py#L6) to allow date filtering by item pipelines.
start_urls = ['http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Pesquisa.aspx'] | ||
|
||
|
||
def parse(self, response): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put all this logic from parse
and scrap_year
removing the start_urls
and implementing the start_requests
method (https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests).
Implemented all suggestions and fixes |
…e a satefull asp net application)
Inlining scrape year Converting rows loop from tree lists to a single list loop
Changing regex so exception is not thrown
f977e6d
to
fd0b0bc
Compare
|
||
def parse_year(self, response): | ||
for i in range(12): | ||
yield self.scrape_month(response, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create a new method that is used only here? Just yield what you are returning in L45 and avoid increasing the complexity of the spider.
) | ||
|
||
def parse_year(self, response): | ||
for i in range(12): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
is not a good variable name. It seems you are trying to search for months, right? Maybe month
is a better name.
Anyway, this will fail in future months (for example, considering the date of this review, december/2018). Maybe include some validation to avoid trying to get future months.
scraped_at=dt.datetime.utcnow() | ||
) | ||
|
||
def scrap_not_extra_edition(self, response): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_regular_edition
is a better name for this method (it is scrape, not scrap anyway).
@antoniovendramin 's pull request (#42) was already merged when @rennerocha offered a valuable code review. This commit addresses the concerns raised there: * Avoid non-meaningful varibale names * Reduce complexity of the spider * Implement a date validation to avoid attempting to scrap future gazettes In addition: * Sort and clean up imports * Replace string format by f-string * Cleanup minor details #42
Lendo os diários oficiais de Curitiba.
Curitiba publica os diários do executivo e legislativo no mesmo arquivo.
A plataforma onde é publicado o arquivo utiliza a tecnologia ASP.NET, com isso o site é statefull obrigando-nos a reenviar os paramentros do estado em todas as requests.