Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salvador spider #47

Merged
merged 4 commits into from
Jul 10, 2018
Merged

Salvador spider #47

merged 4 commits into from
Jul 10, 2018

Conversation

rennerocha
Copy link
Collaborator

For older gazettes, they have the interesting habit to put the file from a different date.
It is also not possible to discover if it is an extra edition without parsing the gazette content. It will require a custom pipeline (specific to this city) to process the file content after it passes through PdfParsingPipeline .

@rennerocha
Copy link
Collaborator Author

The date is not wrong. I discovered that some gazettes are related to more than one day (for example http://www.dom.salvador.ba.gov.br/index.php?option=com_content&view=article&id=5815:dom-7118&catid=1:dom). Updated the spider to handle this situation (unfortunately it will require an extra request for each gazette).

@rennerocha
Copy link
Collaborator Author

To identify extra editions we need to parse the content of the PDF. Unfortunately, when converted, the information is not well structured, so I included a regex that for the gazettes that I saw is enough to make the classification. However it is not 100% guaranteed.

@rennerocha rennerocha changed the title [WIP] - initial version of Salvador spider Salvador spider May 22, 2018
@rennerocha rennerocha changed the title Salvador spider WIP Salvador spider May 23, 2018
@rennerocha rennerocha changed the title WIP Salvador spider Salvador spider May 23, 2018
@alfakini alfakini mentioned this pull request May 24, 2018
@giovanisleite
Copy link
Contributor

Remember to update the cities.md

@cuducos cuducos merged commit 4eb4a92 into okfn-brasil:master Jul 10, 2018
@trevineju trevineju added this to the Capitais | Capital Cities milestone Oct 10, 2022
@trevineju trevineju added the spider Adiciona robô raspador para município(s) label Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spider Adiciona robô raspador para município(s)
Projects
Development

Successfully merging this pull request may close these issues.

4 participants