Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Add spider for Belem - Pará. Issue #182 .
I've tried to run the spider built on PR 87 but it didn't work and based on the spider code I assumed that the website changed since it was built (2 years ago)
Despite that, this PR is still a WIP because I couldn't find a way to download the PDF files.
Problem description
On this spider I am using a different URL (the API one) but you can check in here that we can't get any pdf
href
based on the website DOM. So, as we can download the file just by clicking on the button I guess that the only possible solution that we have is to install selenium, right? Do you see any other possible solution?As I couldn't resolve it by scraping the DOM - because of Javascript - and I wanted to check if there is any other option than downloading selenium I've decided to try to use the API url and see if I can hack a way out to download the gazette pdfs, but unfortunately I couldn't.
If I try to download the file specifying the Accept headers key, like here:
curl -X GET https://sistemas.belem.pa.gov.br/diario-consulta-api/diarios/13959 -H 'Accept: application/octet-stream'
the file is downloaded, but if I don't specify it - that's what happens when weyield
theGazette
the downloaded content is just a pure JSON.What do you think? Do you see any other solution than installing selenium?