Finish Governador Valadares/MG spider #269

giovanisleite · 2020-10-04T18:32:23Z

This is part of the efforts needed for #221
Closes #19

Refactor the parse method
Changed the request headers to meet only what is needed
Updated the city URL (Add https and the endpoint changed too)
Get the URL dynamically (At the start of this PR the URL was another)
Filter by start date

I was able to download all files:

$ ls data/full/*.txt | wc -l
1624

There is no need to log failure

The 12 appears on all files urls, it is part of the URL. We don't know what it is and we don't need to know

processing/data_collection/gazette/spiders/mg_governador_valadares.py

jvanz

I do not see nothing bad. But I would just extract some lines into functions with explanatory names.

processing/data_collection/gazette/spiders/mg_governador_valadares.py

jvanz · 2020-10-18T12:17:52Z

Sorry for the delay... =/

I've just run the spider. It works fine. Awesome work @giovanisleite ! Thanks for your help one this!

giovanisleite · 2020-10-24T21:26:59Z

I do not see nothing bad. But I would just extract some lines into functions with explanatory names.

Done!

Thank you for your help, @jvanz

@jvanz

* Add spider for Fernandopolis, Sao Paulo (okfn-brasil#225) Add spider for Fernandopolis, Sao Paulo * Palmas-TO spider reactor with improved date filtering (okfn-brasil#273) Refactor Palmas-TO spider to allow filtering by date * Obtaing database connection info from Scrapy project settings * MOve database module to be accessible from spiders * Set SQLite as default database to store extracted items SQLite was chosen to allow development easier (as it won't require to install a PostgreSQL instance on development machine). When moving to production server, we will just need to update connection string pointing to the real production database. * Store extracted items in configured database This pipeline stores the results of the scraping in a database. By default we are using a SQLite database to make it easier for development. In production we can setup a proper database. * Rename pipeline with a more generic name We may change our database later, so we should be tied to PostgreSQL at this point. * Add Sigpub base spider and all subspiders SigpubGazetteSpider is a base spider for many systems. Most of them are associations of cities, but some are not. Spiders added: - al_associacao_municipios - am_associacao_municipios - ba_associacao_municipios - ce_associacao_municipios - go_associacao_municipios_agm - go_associacao_municipios_fgm - mg_associacao_municipios - ms_associacao_municipios - mt_associacao_municipios - pa_associacao_municipios - pb_associacao_municipios - pe_associacao_municipios - pi_associacao_municipios - pr_associacao_municipios - rj_associacao_municipios - rn_associacao_municipios - ro_associacao_municipios - rr_associacao_municipios - rs_associacao_municipios - se_associacao_municipios - sp_associacao_municipios - sp_macatuba - sp_monte_alto_sigpub * Update CITIES.md * Limit requests according to start date argument provided * File processed flag (okfn-brasil#306) * .gitignore: sqlite database file Adds the file used by SQLite to store the gazette data in the .gitignore Signed-off-by: José Guilherme Vanz <[email protected]> * SQLite client Updates the container used to run the spider installing the package necessary to access the SQLite database. Furthermore, the "sql" make target now open the sqlite command line. Allowing the user to interact with the database if necessary. Signed-off-by: José Guilherme Vanz <[email protected]> * Processed column Adds a new column in the Gazette model. This column should be used by the processing pipeline to find which gazettes is pending to be ingested by the system. This is not the best solution to this, a queue system would be a better approach for instance. But for the current state of the project this will be enough. Signed-off-by: José Guilherme Vanz <[email protected]> * Add Itu SP Spider (okfn-brasil#303) Co-authored-by: Giulio Carvalho <[email protected]> * Add Piracicaba-SP spider (okfn-brasil#312) * Add Piracicaba-SP spider * Update power of gazette available Co-authored-by: Giulio Carvalho <[email protected]> * Remove text extraction from pipeline (okfn-brasil#310) Processing of content of downloaded files will be done in a separated pipeline, so we don't need it in this part of the process. * Store all files information available for a date (okfn-brasil#315) Some cities have more than one file for the same gazette edition. We need to store uploaded file information (like URL, path and checksum) for all of them. We were storing just the information of the first file available (which is ok for the majority of the cities that have only one file per date), but fails if we have more files available. * Adding a spider for RN Mossoró * Add spider for Ananindeua/PA (okfn-brasil#297) Adds the spider for Ananindeua, Pará * Troubleshooting documentation when building and running container images (okfn-brasil#291) Added instruction to add user to docker group to resolve errors when building or running Co-authored-by: Adorilson Bezerra <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> * Creating a guide to help Windows User to do their setup (okfn-brasil#301) Windows user setup guide * Campina Grande/PB spider (okfn-brasil#287) Add spider for Campina Grande/PB Co-authored-by: Thiago Nóbrega <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Giulio Carvalho <[email protected]> * Upgrade Scrapy to 2.4.0 Upgrades the Scrapy version used by the project to 2.4.0. Signed-off-by: José Guilherme Vanz <[email protected]> * Organize gazettes under directories With Scrapy 2.4.0 is possible to get information from the item to define where the download files will be stored. This commit removes the previous hack to do the same. Now the gazettes files will store in directories organized by territory ID and gazette date. Signed-off-by: José Guilherme Vanz <[email protected]> * Spidermon schema (okfn-brasil#302) * enabling spidermon and create json schema * Drop items when they have validation errors Co-authored-by: Renne Rocha <[email protected]> * Add spider monitoring (okfn-brasil#304) * creating ratio monitor Co-authored-by: Renne Rocha <[email protected]> * Removes from the spiders the autogenerated gazette fields (okfn-brasil#305) * Remove setting `scraped_at` and `territory_id` from items creation on spiders * Remove imported but unused `datetime`s The file list was collected with flake8, as follows: ``` $ flake8 . | grep unused | grep datetime ./spiders/al_maceio.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/ba_feira_de_santana.py:2:1: F401 'datetime as dt' imported but unused ./spiders/ba_vitoria_da_conquista.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/df_brasilia.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/es_associacao_municipios.py:2:1: F401 'datetime as dt' imported but unused ./spiders/go_aparecida_de_goiania.py:1:1: F401 'datetime as dt' imported but unused ./spiders/instar_base.py:1:1: F401 'datetime as dt' imported but unused ./spiders/pb_joao_pessoa.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/pr_cascavel.py:3:1: F401 'datetime as dt' imported but unused ./spiders/pr_curitiba.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/pr_londrina.py:1:1: F401 'datetime as dt' imported but unused ./spiders/rj_campos_goytacazes.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/rn_natal.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/rr_boa_vista.py:1:1: F401 'datetime as dt' imported but unused ./spiders/sc_florianopolis.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/sc_joinville.py:3:1: F401 'datetime.datetime' imported but unused ./spiders/sp_bauru.py:5:1: F401 'datetime.datetime' imported but unused ./spiders/sp_fernandopolis.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/sp_jau.py:4:1: F401 'datetime.datetime' imported but unused ./spiders/sp_presidente_prudente.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/sp_sao_jose_dos_campos.py:3:1: F401 'datetime.datetime' imported but unused ./spiders/to_araguaina.py:4:1: F401 'datetime as dt' imported but unused ./spiders/to_palmas.py:2:1: F401 'datetime as dt' imported but unused ``` * [pr_curitiba][to_palmas] blackify spiders * Default value for the processed column The default value for the processed column in the database should be false. Thus, all the files will be processed by the processing pipeline Signed-off-by: José Guilherme Vanz <[email protected]> * Fix power field to the values expected by spidermon (okfn-brasil#327) * Fix power field to the values expected by spidermon After the spidermon configuration some spider stop working due a invalid value for the power item field. This commit replace all these invalid values for the value allowed by the spidermon schema Signed-off-by: José Guilherme Vanz <[email protected]> * Fix power field for legislative gazettes Fix the spider using invalid values for the power field when the gazette image found is from legislative. Signed-off-by: José Guilherme Vanz <[email protected]> * Format code using black * Automatically sort imports using isort module Signed-off-by: Jonathan Schweder <[email protected]> * Applying backfill formatting for `isort` and `black` utilities Signed-off-by: Jonathan Schweder <[email protected]> * São José dos Pinhais spider (okfn-brasil#325) Adds the spider for São José dos Pinhais, PR Co-authored-by: D <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> * Fix code formatting * Governador Valadares/MG spider (okfn-brasil#269) Add spider for Governador Valadares, MG * Add spider monitoring Monitors included: - Check if there is no errors - If finished correctly - If there is no validation errors Action added when spider finishes: - Send message to Telegram * Initialize database with territories information * Add crawler for Canoas/RS * Run Black and fix case for days without gazettes * Rename START_DATE -> start_date * Move Gazette JSON-Schema to a more suitable location * Update project structure to allow spiders to be deployed in Scrapy Cloud * Fix Telegram message when there is no items scraped * Remove docker fixing conflicts (okfn-brasil#341) * Remove Docker configuration * Reorganize directories and update Makefile to remove Docker references * Create directory to store downlaoded files locally * Update Travis CI removing Docker references * Update docs to execute project without Docker * Include isort as requirement * Inserted Guidelines to maintainers (okfn-brasil#276) * Inserted Guidelines to maintainers I could open an issue to discuss this, but I have adopted a philosophy of starting a discussion with a proposal. I created this chapter to propose an organization for the project maintainers. The idea is that it is not legislation with multiple responsibilities, but a guide that facilitates communication and reduces rework. I do this when we invite @giuliocc to compose our team of maintainers. 🥇 Thus, we are growing and creating a working model to prevent distress to people as generous as Querido Diario maintainers. "There is a criteria I call change latency, which is the round-trip time from identifying a problem to testing a solution. The faster the better. If maintainers cannot respond to pull requests as rapidly as people expect, they're not doing their job (or they need more hands)." (Peter Hintjens) https://hintjens.gitbooks.io/social-architecture/content/chapter4.html I would like opinions: @jvanz @rennerocha * Update at maintainer responsibilities * Add Jinja2 to requirements.in/txt (okfn-brasil#346) * Fix FILES_STORE and README (okfn-brasil#345) * Fix readme instructions * Fix FILES_STORE which is pointing to /mnt/data * Misleading error message when write in the database fails. The integrity error raised by SQLAlchemy is launched not only when the item is already in the database. For instance, it can also be launched when the territories table is missing some territory ID. Thus, this commit updates the error message and add the exception in the log message. Thus, the users will be able to detect which kind of error they are facing. Signed-off-by: José Guilherme Vanz <[email protected]> * Add missing requirements (okfn-brasil#350) * Apply formatting rules to entire project * Add pre-commit configuration * Update documentation adding new step to install pre-commit hooks * Update contributing file to wanr about automatic code formatting * Fix typo * Add some info about data_colletion in README * Update path to data_collection in CONTRIBUTING Co-authored-by: André Angeluci <[email protected]> Co-authored-by: Renne Rocha <[email protected]> Co-authored-by: Giulio Carvalho <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Gabriel (Gabu) Bellon <[email protected]> Co-authored-by: Pedro Peixoto <[email protected]> Co-authored-by: Vinicius Gasparini <[email protected]> Co-authored-by: Camila Fracaro <[email protected]> Co-authored-by: Thiago Curvelo <[email protected]> Co-authored-by: Thiago Nóbrega <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Anderson Berg <[email protected]> Co-authored-by: Jonathan Schweder <[email protected]> Co-authored-by: Giovani Sousa <[email protected]> Co-authored-by: D <[email protected]> Co-authored-by: Lucas Rangel Cezimbra <[email protected]> Co-authored-by: Vitor Baptista <[email protected]> Co-authored-by: Mário Sérgio <[email protected]> Co-authored-by: Rodrigo Vieira <[email protected]>

Lucas Bueno and others added 21 commits October 3, 2020 16:03

Add crawler for gazettes from Belo Horizonte

13560c1

Remove contract from Governador Valadares Spider

b7e2649

Fix start_urls reference error

39e38a8

adjusting source code to be follow irio´s recommendations

d72faf6

Remove errback

0e7a0dd

There is no need to log failure

Add black fixes

74d709b

extend base gazette spider

23d691c

Remove reference to errback_httpbin

52c2f72

Update URL

f97473a

Adjust headers and body keeping only what is needed

b05113d

Remove static parameters from file url

16fa116

The 12 appears on all files urls, it is part of the URL. We don't know what it is and we don't need to know

Replace null by None

9b78851

Remove comment about the constant at the URL

54a2a8c

Use TERRITORY_ID and ITEMS_PER_PAGE as constants names

c2c9476

Move the request body to a method

e5626ae

Change request body from str to json.dumps(dict)

e6b57d6

Reorganize imports

e625a9d

Refactor the parse method to stop using magic numbers

93d211f

Remove unused import

290e50f

Change datetime import

d32766b

Rename date variable to avoid conflict with datetime

10b191b

giovanisleite changed the title ~~Finish Governador Valadares/MG~~ Finish Governador Valadares/MG spider Oct 4, 2020

UPDATE CITIES.md

b72a265

jvanz requested changes Oct 6, 2020

View reviewed changes

processing/data_collection/gazette/spiders/mg_governador_valadares.py Outdated Show resolved Hide resolved

processing/data_collection/gazette/spiders/mg_governador_valadares.py Outdated Show resolved Hide resolved

giovanisleite added 4 commits October 10, 2020 14:53

Set ITEMS_PER_PAGE as string to avoid conversion

7c2f986

Remove don't filter

7418982

Update start_url

b34c404

The path for listing the items is dynamic, we have to find it first

f7eb59b

giovanisleite requested a review from jvanz October 10, 2020 19:03

Filter gazettes by start_date

834c27f

jvanz reviewed Oct 10, 2020

View reviewed changes

Refactor extracting code to methods and giving meaninful names

4e43b88

giovanisleite requested a review from jvanz October 10, 2020 20:22

Order class methods by importance

749c232

giovanisleite force-pushed the reform/mg_governador_valadares branch from f7e00fe to 749c232 Compare October 10, 2020 21:01

jvanz reviewed Oct 18, 2020

View reviewed changes

processing/data_collection/gazette/spiders/mg_governador_valadares.py Outdated Show resolved Hide resolved

Extract code to new functions and give it meaningful names

1d9dd42

giovanisleite requested a review from jvanz October 20, 2020 11:33

sergiomario added the hacktoberfest-accepted Pull Requests aprovados na Hacktoberfest label Oct 28, 2020

jvanz approved these changes Oct 29, 2020

View reviewed changes

jvanz merged commit 1714e47 into okfn-brasil:main Oct 29, 2020

giovanisleite deleted the reform/mg_governador_valadares branch October 29, 2020 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish Governador Valadares/MG spider #269

Finish Governador Valadares/MG spider #269

giovanisleite commented Oct 4, 2020 •

edited

Loading

jvanz left a comment

jvanz commented Oct 18, 2020

giovanisleite commented Oct 24, 2020

Finish Governador Valadares/MG spider #269

Finish Governador Valadares/MG spider #269

Conversation

giovanisleite commented Oct 4, 2020 • edited Loading

jvanz left a comment

Choose a reason for hiding this comment

jvanz commented Oct 18, 2020

giovanisleite commented Oct 24, 2020

giovanisleite commented Oct 4, 2020 •

edited

Loading