Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curitiba #42

Merged
merged 18 commits into from
Jul 10, 2018
Merged

Curitiba #42

merged 18 commits into from
Jul 10, 2018

Conversation

antoniovendramin
Copy link
Contributor

Lendo os diários oficiais de Curitiba.
Curitiba publica os diários do executivo e legislativo no mesmo arquivo.
A plataforma onde é publicado o arquivo utiliza a tecnologia ASP.NET, com isso o site é statefull obrigando-nos a reenviar os paramentros do estado em todas as requests.

"""
todays_date = dt.date.today()
current_year = todays_date.year
for year in reversed(range(2015, current_year + 1)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the step parameter in range so you don't need to reverse the list and it is more efficient:

In [5]: list(range(2018, 2014, -1))
Out[5]: [2018, 2017, 2016, 2015]

)

def parse_month(self, response):
#Count how many pages and iterate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the variable ( page_count) is self-explanatory, so I don't think this comment is necessary.

},
callback=self.parse_page,
)
for page_number in range(2,page_count + 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 :-) for page_number in range(2, page_count + 1):


def scrap_not_extra_edition(self, response):
parsed_date = response.meta['parsed_date']
id = re.findall(r'Id=(\d+)', response.text )[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do that! You have no idea if your regex will match something. You'll get a big IndexError when it happens!
You can use the re method in response. See https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions

Copy link
Contributor

@beothorn beothorn May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response for this request is not a valid html. I believe I can't call re from response because I can't get a valid return from any selector, am I correct?

About raising exceptions, if there is no "Id=" I wouldn't want to ignore this error, because it would mean that the website structure changed. I agree that IndexError is a really bad exception to raise when this happens.
Do you have any suggestion on how should I deal with it in a way that it would be clear that something went wrong? Should I throw a more explanatory exception or is there another way to log this problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rennerocha @cuducos @Irio We don't know what must be done when an unexpected return happen. Should we fail silently or throw an exception? We guess that is the last change before the PR can be accepted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always try to handle the problem. But this is a avoidable problem.

In this case I didn't find where in the page this Id=(\d+) is.
For sure it is a string, so you can use the re_first method: response.selector.re_first('Id=(\d+)').

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.

@alfakini alfakini mentioned this pull request May 24, 2018
@giovanisleite
Copy link
Contributor

Remember to update the cities.md

numbers = row.css("td:nth-child(1) span ::text").extract()
pdf_dates = row.css("td:nth-child(2) span ::text").extract()
ids = row.css("td:nth-child(3) a ::attr(data-teste)").extract()
for i in range(len(numbers)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to do that. rows = response.css(".grid_Row") will return a list of selectors that you can iterate over and process each element:

for row in response.css('.grid_Row'):
    number = row.css("td:nth-child(1) span ::text").extract()
    pdf_date = row.css("td:nth-child(2) span ::text").extract()
    pdf_id =  row.css("td:nth-child(3) a ::attr(data-teste)").extract()
   # do what you need to do from here...

'__EVENTTARGET' : 'ctl00$cphMasterPrincipal$gdvGrid2'
},
callback=self.parse_page,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run PEP8 in this file. You need a newline here.

for i in range(12):
yield self.scrap_month(response, i)

def scrap_month(self, response, month):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is scrape, not scrap.

if id == '0':
yield scrapy.FormRequest.from_response(
response,
headers = {'user-agent': 'Mozilla/5.0'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to change the User-Agent in this request? You are doing similar requests in other places and didn't changed it. If this is really required for this website, I suggest to set a custom user-agent for all the requests made by this spider.

Include a custom_settings in the spider as you can see here https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider and set the desired USER_AGENT key (https://doc.scrapy.org/en/latest/topics/settings.html#user-agent)


def scrap_not_extra_edition(self, response):
parsed_date = response.meta['parsed_date']
id = re.findall(r'Id=(\d+)', response.text )[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always try to handle the problem. But this is a avoidable problem.

In this case I didn't find where in the page this Id=(\d+) is.
For sure it is a string, so you can use the re_first method: response.selector.re_first('Id=(\d+)').

id = re.findall(r'Id=(\d+)', response.text )[0]
return Gazette(
date = parsed_date,
file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a built-in function in Python (https://docs.python.org/3.6/library/functions.html#id). Don't use it as a variable name. Give a better name like pdf_id to avoid problems.


def scrap_not_extra_edition(self, response):
parsed_date = response.meta['parsed_date']
id = re.findall(r'Id=(\d+)', response.text )[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.

return Gazette(
date = parsed_date,
file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)],
is_extra_edition= False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we have gazettes like 92 Supl 1 and 92 . But some days don't have the Supl version.
Isn't it to be considered as an extra edition??? @Irio @cuducos any idea?


from gazette.items import Gazette

class PrCuritibaSpider(scrapy.Spider):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_urls = ['http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Pesquisa.aspx']


def parse(self, response):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put all this logic from parse and scrap_year removing the start_urls and implementing the start_requests method (https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests).

@beothorn
Copy link
Contributor

beothorn commented Jun 5, 2018

Implemented all suggestions and fixes
Unfortunately, we do need to set the user agent or else regular editions will not work ( yes, only regular editions 🤷 ).

@cuducos cuducos merged commit fd0b0bc into okfn-brasil:master Jul 10, 2018

def parse_year(self, response):
for i in range(12):
yield self.scrape_month(response, i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a new method that is used only here? Just yield what you are returning in L45 and avoid increasing the complexity of the spider.

)

def parse_year(self, response):
for i in range(12):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is not a good variable name. It seems you are trying to search for months, right? Maybe month is a better name.
Anyway, this will fail in future months (for example, considering the date of this review, december/2018). Maybe include some validation to avoid trying to get future months.

scraped_at=dt.datetime.utcnow()
)

def scrap_not_extra_edition(self, response):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_regular_edition is a better name for this method (it is scrape, not scrap anyway).

cuducos added a commit that referenced this pull request Jul 24, 2018
@antoniovendramin 's pull request (#42) was already merged when @rennerocha
offered a valuable code review. This commit addresses the concerns
raised there:

* Avoid non-meaningful varibale names
* Reduce complexity of the spider
* Implement a date validation to avoid attempting to scrap future
gazettes

In addition:

* Sort and clean up imports
* Replace string format by f-string
* Cleanup minor details

#42
@cuducos cuducos mentioned this pull request Jul 24, 2018
@trevineju trevineju added this to the Capitais | Capital Cities milestone Oct 10, 2022
@trevineju trevineju added the spider Adiciona robô raspador para município(s) label Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spider Adiciona robô raspador para município(s)
Projects
Development

Successfully merging this pull request may close these issues.

7 participants