-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish Governador Valadares/MG spider #269
Finish Governador Valadares/MG spider #269
Conversation
There is no need to log failure
The 12 appears on all files urls, it is part of the URL. We don't know what it is and we don't need to know
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see nothing bad. But I would just extract some lines into functions with explanatory names.
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
f7e00fe
to
749c232
Compare
processing/data_collection/gazette/spiders/mg_governador_valadares.py
Outdated
Show resolved
Hide resolved
Sorry for the delay... =/ I've just run the spider. It works fine. Awesome work @giovanisleite ! Thanks for your help one this! |
Done! Thank you for your help, @jvanz |
* Add spider for Fernandopolis, Sao Paulo (okfn-brasil#225) Add spider for Fernandopolis, Sao Paulo * Palmas-TO spider reactor with improved date filtering (okfn-brasil#273) Refactor Palmas-TO spider to allow filtering by date * Obtaing database connection info from Scrapy project settings * MOve database module to be accessible from spiders * Set SQLite as default database to store extracted items SQLite was chosen to allow development easier (as it won't require to install a PostgreSQL instance on development machine). When moving to production server, we will just need to update connection string pointing to the real production database. * Store extracted items in configured database This pipeline stores the results of the scraping in a database. By default we are using a SQLite database to make it easier for development. In production we can setup a proper database. * Rename pipeline with a more generic name We may change our database later, so we should be tied to PostgreSQL at this point. * Add Sigpub base spider and all subspiders SigpubGazetteSpider is a base spider for many systems. Most of them are associations of cities, but some are not. Spiders added: - al_associacao_municipios - am_associacao_municipios - ba_associacao_municipios - ce_associacao_municipios - go_associacao_municipios_agm - go_associacao_municipios_fgm - mg_associacao_municipios - ms_associacao_municipios - mt_associacao_municipios - pa_associacao_municipios - pb_associacao_municipios - pe_associacao_municipios - pi_associacao_municipios - pr_associacao_municipios - rj_associacao_municipios - rn_associacao_municipios - ro_associacao_municipios - rr_associacao_municipios - rs_associacao_municipios - se_associacao_municipios - sp_associacao_municipios - sp_macatuba - sp_monte_alto_sigpub * Update CITIES.md * Limit requests according to start date argument provided * File processed flag (okfn-brasil#306) * .gitignore: sqlite database file Adds the file used by SQLite to store the gazette data in the .gitignore Signed-off-by: José Guilherme Vanz <[email protected]> * SQLite client Updates the container used to run the spider installing the package necessary to access the SQLite database. Furthermore, the "sql" make target now open the sqlite command line. Allowing the user to interact with the database if necessary. Signed-off-by: José Guilherme Vanz <[email protected]> * Processed column Adds a new column in the Gazette model. This column should be used by the processing pipeline to find which gazettes is pending to be ingested by the system. This is not the best solution to this, a queue system would be a better approach for instance. But for the current state of the project this will be enough. Signed-off-by: José Guilherme Vanz <[email protected]> * Add Itu SP Spider (okfn-brasil#303) Co-authored-by: Giulio Carvalho <[email protected]> * Add Piracicaba-SP spider (okfn-brasil#312) * Add Piracicaba-SP spider * Update power of gazette available Co-authored-by: Giulio Carvalho <[email protected]> * Remove text extraction from pipeline (okfn-brasil#310) Processing of content of downloaded files will be done in a separated pipeline, so we don't need it in this part of the process. * Store all files information available for a date (okfn-brasil#315) Some cities have more than one file for the same gazette edition. We need to store uploaded file information (like URL, path and checksum) for all of them. We were storing just the information of the first file available (which is ok for the majority of the cities that have only one file per date), but fails if we have more files available. * Adding a spider for RN Mossoró * Add spider for Ananindeua/PA (okfn-brasil#297) Adds the spider for Ananindeua, Pará * Troubleshooting documentation when building and running container images (okfn-brasil#291) Added instruction to add user to docker group to resolve errors when building or running Co-authored-by: Adorilson Bezerra <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> * Creating a guide to help Windows User to do their setup (okfn-brasil#301) Windows user setup guide * Campina Grande/PB spider (okfn-brasil#287) Add spider for Campina Grande/PB Co-authored-by: Thiago Nóbrega <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Giulio Carvalho <[email protected]> * Upgrade Scrapy to 2.4.0 Upgrades the Scrapy version used by the project to 2.4.0. Signed-off-by: José Guilherme Vanz <[email protected]> * Organize gazettes under directories With Scrapy 2.4.0 is possible to get information from the item to define where the download files will be stored. This commit removes the previous hack to do the same. Now the gazettes files will store in directories organized by territory ID and gazette date. Signed-off-by: José Guilherme Vanz <[email protected]> * Spidermon schema (okfn-brasil#302) * enabling spidermon and create json schema * Drop items when they have validation errors Co-authored-by: Renne Rocha <[email protected]> * Add spider monitoring (okfn-brasil#304) * creating ratio monitor Co-authored-by: Renne Rocha <[email protected]> * Removes from the spiders the autogenerated gazette fields (okfn-brasil#305) * Remove setting `scraped_at` and `territory_id` from items creation on spiders * Remove imported but unused `datetime`s The file list was collected with flake8, as follows: ``` $ flake8 . | grep unused | grep datetime ./spiders/al_maceio.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/ba_feira_de_santana.py:2:1: F401 'datetime as dt' imported but unused ./spiders/ba_vitoria_da_conquista.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/df_brasilia.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/es_associacao_municipios.py:2:1: F401 'datetime as dt' imported but unused ./spiders/go_aparecida_de_goiania.py:1:1: F401 'datetime as dt' imported but unused ./spiders/instar_base.py:1:1: F401 'datetime as dt' imported but unused ./spiders/pb_joao_pessoa.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/pr_cascavel.py:3:1: F401 'datetime as dt' imported but unused ./spiders/pr_curitiba.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/pr_londrina.py:1:1: F401 'datetime as dt' imported but unused ./spiders/rj_campos_goytacazes.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/rn_natal.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/rr_boa_vista.py:1:1: F401 'datetime as dt' imported but unused ./spiders/sc_florianopolis.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/sc_joinville.py:3:1: F401 'datetime.datetime' imported but unused ./spiders/sp_bauru.py:5:1: F401 'datetime.datetime' imported but unused ./spiders/sp_fernandopolis.py:1:1: F401 'datetime.datetime' imported but unused ./spiders/sp_jau.py:4:1: F401 'datetime.datetime' imported but unused ./spiders/sp_presidente_prudente.py:2:1: F401 'datetime.datetime' imported but unused ./spiders/sp_sao_jose_dos_campos.py:3:1: F401 'datetime.datetime' imported but unused ./spiders/to_araguaina.py:4:1: F401 'datetime as dt' imported but unused ./spiders/to_palmas.py:2:1: F401 'datetime as dt' imported but unused ``` * [pr_curitiba][to_palmas] blackify spiders * Default value for the processed column The default value for the processed column in the database should be false. Thus, all the files will be processed by the processing pipeline Signed-off-by: José Guilherme Vanz <[email protected]> * Fix power field to the values expected by spidermon (okfn-brasil#327) * Fix power field to the values expected by spidermon After the spidermon configuration some spider stop working due a invalid value for the power item field. This commit replace all these invalid values for the value allowed by the spidermon schema Signed-off-by: José Guilherme Vanz <[email protected]> * Fix power field for legislative gazettes Fix the spider using invalid values for the power field when the gazette image found is from legislative. Signed-off-by: José Guilherme Vanz <[email protected]> * Format code using black * Automatically sort imports using isort module Signed-off-by: Jonathan Schweder <[email protected]> * Applying backfill formatting for `isort` and `black` utilities Signed-off-by: Jonathan Schweder <[email protected]> * São José dos Pinhais spider (okfn-brasil#325) Adds the spider for São José dos Pinhais, PR Co-authored-by: D <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> * Fix code formatting * Governador Valadares/MG spider (okfn-brasil#269) Add spider for Governador Valadares, MG * Add spider monitoring Monitors included: - Check if there is no errors - If finished correctly - If there is no validation errors Action added when spider finishes: - Send message to Telegram * Initialize database with territories information * Add crawler for Canoas/RS * Run Black and fix case for days without gazettes * Rename START_DATE -> start_date * Move Gazette JSON-Schema to a more suitable location * Update project structure to allow spiders to be deployed in Scrapy Cloud * Fix Telegram message when there is no items scraped * Remove docker fixing conflicts (okfn-brasil#341) * Remove Docker configuration * Reorganize directories and update Makefile to remove Docker references * Create directory to store downlaoded files locally * Update Travis CI removing Docker references * Update docs to execute project without Docker * Include isort as requirement * Inserted Guidelines to maintainers (okfn-brasil#276) * Inserted Guidelines to maintainers I could open an issue to discuss this, but I have adopted a philosophy of starting a discussion with a proposal. I created this chapter to propose an organization for the project maintainers. The idea is that it is not legislation with multiple responsibilities, but a guide that facilitates communication and reduces rework. I do this when we invite @giuliocc to compose our team of maintainers. 🥇 Thus, we are growing and creating a working model to prevent distress to people as generous as Querido Diario maintainers. "There is a criteria I call change latency, which is the round-trip time from identifying a problem to testing a solution. The faster the better. If maintainers cannot respond to pull requests as rapidly as people expect, they're not doing their job (or they need more hands)." (Peter Hintjens) https://hintjens.gitbooks.io/social-architecture/content/chapter4.html I would like opinions: @jvanz @rennerocha * Update at maintainer responsibilities * Add Jinja2 to requirements.in/txt (okfn-brasil#346) * Fix FILES_STORE and README (okfn-brasil#345) * Fix readme instructions * Fix FILES_STORE which is pointing to /mnt/data * Misleading error message when write in the database fails. The integrity error raised by SQLAlchemy is launched not only when the item is already in the database. For instance, it can also be launched when the territories table is missing some territory ID. Thus, this commit updates the error message and add the exception in the log message. Thus, the users will be able to detect which kind of error they are facing. Signed-off-by: José Guilherme Vanz <[email protected]> * Add missing requirements (okfn-brasil#350) * Apply formatting rules to entire project * Add pre-commit configuration * Update documentation adding new step to install pre-commit hooks * Update contributing file to wanr about automatic code formatting * Fix typo * Add some info about data_colletion in README * Update path to data_collection in CONTRIBUTING Co-authored-by: André Angeluci <[email protected]> Co-authored-by: Renne Rocha <[email protected]> Co-authored-by: Giulio Carvalho <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Gabriel (Gabu) Bellon <[email protected]> Co-authored-by: Pedro Peixoto <[email protected]> Co-authored-by: Vinicius Gasparini <[email protected]> Co-authored-by: Camila Fracaro <[email protected]> Co-authored-by: Thiago Curvelo <[email protected]> Co-authored-by: Thiago Nóbrega <[email protected]> Co-authored-by: José Guilherme Vanz <[email protected]> Co-authored-by: Anderson Berg <[email protected]> Co-authored-by: Jonathan Schweder <[email protected]> Co-authored-by: Giovani Sousa <[email protected]> Co-authored-by: D <[email protected]> Co-authored-by: Lucas Rangel Cezimbra <[email protected]> Co-authored-by: Vitor Baptista <[email protected]> Co-authored-by: Mário Sérgio <[email protected]> Co-authored-by: Rodrigo Vieira <[email protected]>
This is part of the efforts needed for #221
Closes #19
I was able to download all files: