Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish Governador Valadares/MG spider #269

Merged
merged 30 commits into from
Oct 29, 2020

Conversation

giovanisleite
Copy link
Contributor

@giovanisleite giovanisleite commented Oct 4, 2020

This is part of the efforts needed for #221
Closes #19

  • Refactor the parse method
  • Changed the request headers to meet only what is needed
  • Updated the city URL (Add https and the endpoint changed too)
  • Get the URL dynamically (At the start of this PR the URL was another)
  • Filter by start date

I was able to download all files:

image

$ ls data/full/*.txt | wc -l
1624

@giovanisleite giovanisleite changed the title Finish Governador Valadares/MG Finish Governador Valadares/MG spider Oct 4, 2020
@giovanisleite giovanisleite requested a review from jvanz October 10, 2020 19:03
Copy link
Collaborator

@jvanz jvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see nothing bad. But I would just extract some lines into functions with explanatory names.

@giovanisleite giovanisleite requested a review from jvanz October 10, 2020 20:22
@giovanisleite giovanisleite force-pushed the reform/mg_governador_valadares branch from f7e00fe to 749c232 Compare October 10, 2020 21:01
@jvanz
Copy link
Collaborator

jvanz commented Oct 18, 2020

Sorry for the delay... =/

I've just run the spider. It works fine. Awesome work @giovanisleite ! Thanks for your help one this!

@giovanisleite giovanisleite requested a review from jvanz October 20, 2020 11:33
@giovanisleite
Copy link
Contributor Author

I do not see nothing bad. But I would just extract some lines into functions with explanatory names.

Done!

Thank you for your help, @jvanz

@sergiomario sergiomario added the hacktoberfest-accepted Pull Requests aprovados na Hacktoberfest label Oct 28, 2020
@jvanz jvanz merged commit 1714e47 into okfn-brasil:main Oct 29, 2020
@giovanisleite giovanisleite deleted the reform/mg_governador_valadares branch October 29, 2020 20:11
adorilson added a commit to adorilson/querido-diario that referenced this pull request Nov 8, 2020
* Add spider for Fernandopolis, Sao Paulo (okfn-brasil#225)

Add spider for Fernandopolis, Sao Paulo

* Palmas-TO spider reactor with improved date filtering (okfn-brasil#273)

Refactor Palmas-TO spider to allow filtering by date

* Obtaing database connection info from Scrapy project settings

* MOve database module to be accessible from spiders

* Set SQLite as default database to store extracted items

SQLite was chosen to allow development easier (as it won't require to
install a PostgreSQL instance on development machine). When moving to
production server, we will just need to update connection string
pointing to the real production database.

* Store extracted items in configured database

This pipeline stores the results of the scraping in a database. By
default we are using a SQLite database to make it easier for
development. In production we can setup a proper database.

* Rename pipeline with a more generic name

We may change our database later, so we should be tied to PostgreSQL at
this point.

* Add Sigpub base spider and all subspiders

SigpubGazetteSpider is a base spider for many systems. Most of them are
associations of cities, but some are not.

Spiders added:
- al_associacao_municipios
- am_associacao_municipios
- ba_associacao_municipios
- ce_associacao_municipios
- go_associacao_municipios_agm
- go_associacao_municipios_fgm
- mg_associacao_municipios
- ms_associacao_municipios
- mt_associacao_municipios
- pa_associacao_municipios
- pb_associacao_municipios
- pe_associacao_municipios
- pi_associacao_municipios
- pr_associacao_municipios
- rj_associacao_municipios
- rn_associacao_municipios
- ro_associacao_municipios
- rr_associacao_municipios
- rs_associacao_municipios
- se_associacao_municipios
- sp_associacao_municipios
- sp_macatuba
- sp_monte_alto_sigpub

* Update CITIES.md

* Limit requests according to start date argument provided

* File processed flag (okfn-brasil#306)

* .gitignore: sqlite database file

Adds the file used by SQLite to store the gazette data in the .gitignore

Signed-off-by: José Guilherme Vanz <[email protected]>

* SQLite client

Updates the container used to run the spider installing the package
necessary to access the SQLite database. Furthermore, the "sql" make
target now open the sqlite command line. Allowing the user to interact
with the database if necessary.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Processed column

Adds a new column in the Gazette model. This column should be used by
the processing pipeline to find which gazettes is pending to be ingested
by the system. This is not the best solution to this, a queue system
would be a better approach for instance. But for the current state of the
project this will be enough.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Add Itu SP Spider (okfn-brasil#303)

Co-authored-by: Giulio Carvalho <[email protected]>

* Add Piracicaba-SP spider (okfn-brasil#312)

* Add Piracicaba-SP spider

* Update power of gazette available

Co-authored-by: Giulio Carvalho <[email protected]>

* Remove text extraction from pipeline (okfn-brasil#310)

Processing of content of downloaded files will be done in a separated
pipeline, so we don't need it in this part of the process.

* Store all files information available for a date (okfn-brasil#315)

Some cities have more than one file for the same gazette edition. We
need to store uploaded file information (like URL, path and checksum)
for all of them. We were storing just the information of the first file
available (which is ok for the majority of the cities that have only one
file per date), but fails if we have more files available.

* Adding a spider for RN Mossoró

* Add spider for Ananindeua/PA (okfn-brasil#297)

Adds the spider for Ananindeua, Pará

* Troubleshooting documentation when building and running container images  (okfn-brasil#291)

Added instruction to add user to docker group to resolve errors when building or running

Co-authored-by: Adorilson Bezerra <[email protected]>
Co-authored-by: José Guilherme Vanz <[email protected]>

* Creating a guide to help Windows User to do their setup (okfn-brasil#301)

Windows user setup guide

* Campina Grande/PB spider  (okfn-brasil#287)

Add spider for Campina Grande/PB

Co-authored-by: Thiago Nóbrega <[email protected]>
Co-authored-by: José Guilherme Vanz <[email protected]>
Co-authored-by: Giulio Carvalho <[email protected]>

* Upgrade Scrapy to 2.4.0

Upgrades the Scrapy version used by the project to 2.4.0.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Organize gazettes under directories

With Scrapy 2.4.0 is possible to get information from the item to define
where the download files will be stored. This commit removes the
previous hack to do the same. Now the gazettes files will store in
directories organized by territory ID and gazette date.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Spidermon schema (okfn-brasil#302)

* enabling spidermon and create json schema
* Drop items when they have validation errors

Co-authored-by: Renne Rocha <[email protected]>

* Add spider monitoring (okfn-brasil#304)

* creating ratio monitor

Co-authored-by: Renne Rocha <[email protected]>

* Removes from the spiders the autogenerated gazette fields (okfn-brasil#305)

* Remove setting `scraped_at` and `territory_id` from items creation on spiders

* Remove imported but unused `datetime`s

The file list was collected with flake8, as follows:

```
$ flake8 . | grep unused | grep datetime

./spiders/al_maceio.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/ba_feira_de_santana.py:2:1: F401 'datetime as dt' imported but unused
./spiders/ba_vitoria_da_conquista.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/df_brasilia.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/es_associacao_municipios.py:2:1: F401 'datetime as dt' imported but unused
./spiders/go_aparecida_de_goiania.py:1:1: F401 'datetime as dt' imported but unused
./spiders/instar_base.py:1:1: F401 'datetime as dt' imported but unused
./spiders/pb_joao_pessoa.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/pr_cascavel.py:3:1: F401 'datetime as dt' imported but unused
./spiders/pr_curitiba.py:2:1: F401 'datetime.datetime' imported but unused
./spiders/pr_londrina.py:1:1: F401 'datetime as dt' imported but unused
./spiders/rj_campos_goytacazes.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/rn_natal.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/rr_boa_vista.py:1:1: F401 'datetime as dt' imported but unused
./spiders/sc_florianopolis.py:2:1: F401 'datetime.datetime' imported but unused
./spiders/sc_joinville.py:3:1: F401 'datetime.datetime' imported but unused
./spiders/sp_bauru.py:5:1: F401 'datetime.datetime' imported but unused
./spiders/sp_fernandopolis.py:1:1: F401 'datetime.datetime' imported but unused
./spiders/sp_jau.py:4:1: F401 'datetime.datetime' imported but unused
./spiders/sp_presidente_prudente.py:2:1: F401 'datetime.datetime' imported but unused
./spiders/sp_sao_jose_dos_campos.py:3:1: F401 'datetime.datetime' imported but unused
./spiders/to_araguaina.py:4:1: F401 'datetime as dt' imported but unused
./spiders/to_palmas.py:2:1: F401 'datetime as dt' imported but unused
```

* [pr_curitiba][to_palmas] blackify spiders

* Default value for the processed column

The default value for the processed column in the database should be
false. Thus, all the files will be processed by the processing pipeline

Signed-off-by: José Guilherme Vanz <[email protected]>

* Fix power field to the values expected by spidermon (okfn-brasil#327)

* Fix power field to the values expected by spidermon

After the spidermon configuration some spider stop working due a invalid
value for the power item field. This commit replace all these invalid
values for the value allowed by the spidermon schema

Signed-off-by: José Guilherme Vanz <[email protected]>

* Fix power field for legislative gazettes

Fix the spider using invalid values for the power field when the gazette
image found is from legislative.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Format code using black

* Automatically sort imports using isort module

Signed-off-by: Jonathan Schweder <[email protected]>

* Applying backfill formatting for `isort` and `black` utilities

Signed-off-by: Jonathan Schweder <[email protected]>

* São José dos Pinhais spider (okfn-brasil#325)

Adds the spider for São José dos Pinhais, PR

Co-authored-by: D <[email protected]>
Co-authored-by: José Guilherme Vanz <[email protected]>

* Fix code formatting

* Governador Valadares/MG spider (okfn-brasil#269)

Add spider for Governador Valadares, MG

* Add spider monitoring

Monitors included:
- Check if there is no errors
- If finished correctly
- If there is no validation errors

Action added when spider finishes:
- Send message to Telegram

* Initialize database with territories information

* Add crawler for Canoas/RS

* Run Black and fix case for days without gazettes

* Rename START_DATE -> start_date

* Move Gazette JSON-Schema to a more suitable location

* Update project structure to allow spiders to be deployed in Scrapy Cloud

* Fix Telegram message when there is no items scraped

* Remove docker fixing conflicts (okfn-brasil#341)

* Remove Docker configuration

* Reorganize directories and update Makefile to remove Docker references

* Create directory to store downlaoded files locally

* Update Travis CI removing Docker references

* Update docs to execute project without Docker

* Include isort as requirement

* Inserted Guidelines to maintainers (okfn-brasil#276)

* Inserted Guidelines to maintainers

I could open an issue to discuss this, but I have adopted a philosophy of starting a discussion with a proposal.

I created this chapter to propose an organization for the project maintainers. The idea is that it is not legislation with multiple responsibilities, but a guide that facilitates communication and reduces rework.

I do this when we invite @giuliocc to compose our team of maintainers. 🥇  Thus, we are growing and creating a working model to prevent distress to people as generous as Querido Diario maintainers.

"There is a criteria I call change latency, which is the round-trip time from identifying a problem to testing a solution. The faster the better. If maintainers cannot respond to pull requests as rapidly as people expect, they're not doing their job (or they need more hands)." (Peter Hintjens)
https://hintjens.gitbooks.io/social-architecture/content/chapter4.html

I would like opinions: @jvanz @rennerocha

* Update at maintainer responsibilities

* Add Jinja2 to requirements.in/txt (okfn-brasil#346)

* Fix FILES_STORE and README (okfn-brasil#345)

* Fix readme instructions

* Fix FILES_STORE which is pointing to /mnt/data

* Misleading error message when write in the database fails.

The integrity error raised by SQLAlchemy is launched not only when the
item is already in the database. For instance, it can also be launched
when the territories table is missing some territory ID. Thus, this
commit updates the error message and add the exception in the log
message. Thus, the users will be able to detect which kind of error
they are facing.

Signed-off-by: José Guilherme Vanz <[email protected]>

* Add missing requirements (okfn-brasil#350)

* Apply formatting rules to entire project

* Add pre-commit configuration

* Update documentation adding new step to install pre-commit hooks

* Update contributing file to wanr about automatic code formatting

* Fix typo

* Add some info about data_colletion in README

* Update path to data_collection in CONTRIBUTING

Co-authored-by: André Angeluci <[email protected]>
Co-authored-by: Renne Rocha <[email protected]>
Co-authored-by: Giulio Carvalho <[email protected]>
Co-authored-by: José Guilherme Vanz <[email protected]>
Co-authored-by: Gabriel (Gabu) Bellon <[email protected]>
Co-authored-by: Pedro Peixoto <[email protected]>
Co-authored-by: Vinicius Gasparini <[email protected]>
Co-authored-by: Camila Fracaro <[email protected]>
Co-authored-by: Thiago Curvelo <[email protected]>
Co-authored-by: Thiago Nóbrega <[email protected]>
Co-authored-by: José Guilherme Vanz <[email protected]>
Co-authored-by: Anderson Berg <[email protected]>
Co-authored-by: Jonathan Schweder <[email protected]>
Co-authored-by: Giovani Sousa <[email protected]>
Co-authored-by: D <[email protected]>
Co-authored-by: Lucas Rangel Cezimbra <[email protected]>
Co-authored-by: Vitor Baptista <[email protected]>
Co-authored-by: Mário Sérgio <[email protected]>
Co-authored-by: Rodrigo Vieira <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Pull Requests aprovados na Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants