Curitiba #42

antoniovendramin · 2018-05-17T21:23:56Z

Lendo os diários oficiais de Curitiba.
Curitiba publica os diários do executivo e legislativo no mesmo arquivo.
A plataforma onde é publicado o arquivo utiliza a tecnologia ASP.NET, com isso o site é statefull obrigando-nos a reenviar os paramentros do estado em todas as requests.

rennerocha · 2018-05-21T11:49:50Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        """
+        todays_date = dt.date.today()
+        current_year = todays_date.year
+        for year in reversed(range(2015, current_year + 1)):


You can use the step parameter in range so you don't need to reverse the list and it is more efficient:

In [5]: list(range(2018, 2014, -1)) Out[5]: [2018, 2017, 2016, 2015]

rennerocha · 2018-05-21T11:51:19Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        )
+
+    def parse_month(self, response):
+        #Count how many pages and iterate


The name of the variable ( page_count) is self-explanatory, so I don't think this comment is necessary.

rennerocha · 2018-05-21T11:51:42Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+            },
+            callback=self.parse_page,
+        )
+        for page_number in range(2,page_count + 1):


PEP8 :-) for page_number in range(2, page_count + 1):

rennerocha · 2018-05-21T12:07:23Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+
+    def scrap_not_extra_edition(self, response):
+        parsed_date = response.meta['parsed_date']
+        id = re.findall(r'Id=(\d+)', response.text )[0]


Don't do that! You have no idea if your regex will match something. You'll get a big IndexError when it happens!
You can use the re method in response. See https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions

The response for this request is not a valid html. I believe I can't call re from response because I can't get a valid return from any selector, am I correct?

About raising exceptions, if there is no "Id=" I wouldn't want to ignore this error, because it would mean that the website structure changed. I agree that IndexError is a really bad exception to raise when this happens.
Do you have any suggestion on how should I deal with it in a way that it would be clear that something went wrong? Should I throw a more explanatory exception or is there another way to log this problem?

@rennerocha @cuducos @Irio We don't know what must be done when an unexpected return happen. Should we fail silently or throw an exception? We guess that is the last change before the PR can be accepted.

Always try to handle the problem. But this is a avoidable problem.

In this case I didn't find where in the page this Id=(\d+) is.
For sure it is a string, so you can use the re_first method: response.selector.re_first('Id=(\d+)').

If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.

giovanisleite · 2018-05-30T01:17:32Z

Remember to update the cities.md

rennerocha · 2018-05-30T18:21:48Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        numbers = row.css("td:nth-child(1) span ::text").extract()
+        pdf_dates = row.css("td:nth-child(2) span ::text").extract()
+        ids = row.css("td:nth-child(3) a ::attr(data-teste)").extract()
+        for i in range(len(numbers)):


You don't need to do that. rows = response.css(".grid_Row") will return a list of selectors that you can iterate over and process each element:

for row in response.css('.grid_Row'): number = row.css("td:nth-child(1) span ::text").extract() pdf_date = row.css("td:nth-child(2) span ::text").extract() pdf_id = row.css("td:nth-child(3) a ::attr(data-teste)").extract() # do what you need to do from here...

rennerocha · 2018-05-30T18:22:15Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+                    '__EVENTTARGET' : 'ctl00$cphMasterPrincipal$gdvGrid2'
+                },
+                callback=self.parse_page,
+            )


Run PEP8 in this file. You need a newline here.

rennerocha · 2018-05-30T18:23:09Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        for i in range(12):
+            yield self.scrap_month(response, i)
+
+    def scrap_month(self, response, month):


It is scrape, not scrap.

rennerocha · 2018-05-30T18:27:05Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+            if id == '0':
+                yield scrapy.FormRequest.from_response(
+                    response,
+                    headers = {'user-agent': 'Mozilla/5.0'},


Is it really necessary to change the User-Agent in this request? You are doing similar requests in other places and didn't changed it. If this is really required for this website, I suggest to set a custom user-agent for all the requests made by this spider.

Include a custom_settings in the spider as you can see here https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider and set the desired USER_AGENT key (https://doc.scrapy.org/en/latest/topics/settings.html#user-agent)

rennerocha · 2018-05-30T18:35:20Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+
+    def scrap_not_extra_edition(self, response):
+        parsed_date = response.meta['parsed_date']
+        id = re.findall(r'Id=(\d+)', response.text )[0]


Always try to handle the problem. But this is a avoidable problem.

In this case I didn't find where in the page this Id=(\d+) is.
For sure it is a string, so you can use the re_first method: response.selector.re_first('Id=(\d+)').

rennerocha · 2018-05-30T18:37:32Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        id = re.findall(r'Id=(\d+)', response.text )[0]
+        return Gazette(
+            date = parsed_date,
+            file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)],


id is a built-in function in Python (https://docs.python.org/3.6/library/functions.html#id). Don't use it as a variable name. Give a better name like pdf_id to avoid problems.

rennerocha · 2018-05-30T18:41:02Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+
+    def scrap_not_extra_edition(self, response):
+        parsed_date = response.meta['parsed_date']
+        id = re.findall(r'Id=(\d+)', response.text )[0]


If the page changes in future (and I hope so, because dealing with these _VIEWSTATES are terrible), we will notice monitoring it and realizing that no items were scraped.

rennerocha · 2018-05-30T18:42:56Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+        return Gazette(
+            date = parsed_date,
+            file_urls=["http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Download.aspx?id={}".format(id)],
+            is_extra_edition= False,


I noticed that we have gazettes like 92 Supl 1 and 92 . But some days don't have the Supl version.
Isn't it to be considered as an extra edition??? @Irio @cuducos any idea?

rennerocha · 2018-05-30T18:43:57Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+
+from gazette.items import Gazette
+
+class PrCuritibaSpider(scrapy.Spider):


Inherit from BaseGazetteSpider(https://github.com/okfn-brasil/diario-oficial/blob/master/processing/data_collection/gazette/spiders/base.py#L6) to allow date filtering by item pipelines.

rennerocha · 2018-05-30T18:45:43Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+    start_urls = ['http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Pesquisa.aspx']
+
+
+    def parse(self, response):


You could put all this logic from parse and scrap_year removing the start_urls and implementing the start_requests method (https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests).

beothorn · 2018-06-05T00:03:43Z

Implemented all suggestions and fixes
Unfortunately, we do need to set the user agent or else regular editions will not work ( yes, only regular editions 🤷 ).

…e a satefull asp net application)

Inlining scrape year Converting rows loop from tree lists to a single list loop

Changing regex so exception is not thrown

rennerocha · 2018-07-24T01:42:48Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+
+    def parse_year(self, response):
+        for i in range(12):
+            yield self.scrape_month(response, i)


Why create a new method that is used only here? Just yield what you are returning in L45 and avoid increasing the complexity of the spider.

rennerocha · 2018-07-24T01:44:48Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+            )
+
+    def parse_year(self, response):
+        for i in range(12):


i is not a good variable name. It seems you are trying to search for months, right? Maybe month is a better name.
Anyway, this will fail in future months (for example, considering the date of this review, december/2018). Maybe include some validation to avoid trying to get future months.

rennerocha · 2018-07-24T01:47:05Z

processing/data_collection/gazette/spiders/pr_curitiba.py

+                    scraped_at=dt.datetime.utcnow()
+                )
+
+    def scrap_not_extra_edition(self, response):


parse_regular_edition is a better name for this method (it is scrape, not scrap anyway).

@antoniovendramin

@antoniovendramin 's pull request (#42) was already merged when @rennerocha offered a valuable code review. This commit addresses the concerns raised there: * Avoid non-meaningful varibale names * Reduce complexity of the spider * Implement a date validation to avoid attempting to scrap future gazettes In addition: * Sort and clean up imports * Replace string format by f-string * Cleanup minor details #42

rennerocha reviewed May 21, 2018

View reviewed changes

alfakini mentioned this pull request May 24, 2018

Cities #48

Closed

antoniovendramin force-pushed the curitiba branch from 4da1959 to 3be04c0 Compare May 30, 2018 17:25

rennerocha requested changes May 30, 2018

View reviewed changes

beothorn force-pushed the curitiba branch from b36baae to f977e6d Compare June 4, 2018 23:58

beothorn and others added 18 commits June 6, 2018 18:29

Scraping 4th page of January 2015 (confirming it is possible to scrap…

6d96ebe

…e a satefull asp net application)

scraping all pages

eb5fad9

Gathering extra edition gazettes

f5f7b9d

Experimenting, trying to get popup values

e1aba53

Trying get the correct id of each original gazzete

82ae630

Create a gazette using the ID that returns from the server

a7cee8d

Still not getting first page

4ff9256

Finnally working!! 2015 pdf count checks out (492 entries)

56aff21

Clean the code removing some prints

4f29761

Reduce the code using a row to remove repited code

ebc0b73

Add a comment about the first page of pagination

8c5c3c9

Add the header with user-agent back

c6c18f6

Applying changes suggested by @rennerocha

587dfa2

Change status of Crawler of Curitiba

3165684

Fixing Pep8 warnings

e30ae42

Changing bas class to use start date

838aeee

Inlining scrape year Converting rows loop from tree lists to a single list loop

Using custom settings to change user agent on all requests

4e3ab65

Changing regex so exception is not thrown

Change the name of property from MUNICIPALITY to TERRITORY

fd0b0bc

antoniovendramin force-pushed the curitiba branch from f977e6d to fd0b0bc Compare June 6, 2018 21:47

cuducos merged commit fd0b0bc into okfn-brasil:master Jul 10, 2018

rennerocha reviewed Jul 24, 2018

View reviewed changes

cuducos mentioned this pull request Jul 24, 2018

Refactor Curitiba spider #99

Merged

trevineju added this to the Capitais | Capital Cities milestone Oct 10, 2022

trevineju added the spider Adiciona robô raspador para município(s) label Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curitiba #42

Curitiba #42

antoniovendramin commented May 17, 2018

rennerocha May 21, 2018

rennerocha May 21, 2018

rennerocha May 21, 2018

rennerocha May 21, 2018

beothorn May 21, 2018 •

edited

Loading

antoniovendramin May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

giovanisleite commented May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

rennerocha May 30, 2018

beothorn commented Jun 5, 2018

rennerocha Jul 24, 2018

rennerocha Jul 24, 2018

rennerocha Jul 24, 2018


		from gazette.items import Gazette

		class PrCuritibaSpider(scrapy.Spider):

		start_urls = ['http://legisladocexterno.curitiba.pr.gov.br/DiarioConsultaExterna_Pesquisa.aspx']


		def parse(self, response):

Curitiba #42

Curitiba #42

Conversation

antoniovendramin commented May 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beothorn May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giovanisleite commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beothorn commented Jun 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beothorn May 21, 2018 •

edited

Loading