Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spider for Belem - Pará #212

Merged
merged 3 commits into from
Aug 19, 2020
Merged

Conversation

rodolfolottin
Copy link
Contributor

Description

Add spider for Belem - Pará. Issue #182 .

I've tried to run the spider built on PR 87 but it didn't work and based on the spider code I assumed that the website changed since it was built (2 years ago)

Despite that, this PR is still a WIP because I couldn't find a way to download the PDF files.

Problem description

On this spider I am using a different URL (the API one) but you can check in here that we can't get any pdf href based on the website DOM. So, as we can download the file just by clicking on the button I guess that the only possible solution that we have is to install selenium, right? Do you see any other possible solution?

As I couldn't resolve it by scraping the DOM - because of Javascript - and I wanted to check if there is any other option than downloading selenium I've decided to try to use the API url and see if I can hack a way out to download the gazette pdfs, but unfortunately I couldn't.

If I try to download the file specifying the Accept headers key, like here: curl -X GET https://sistemas.belem.pa.gov.br/diario-consulta-api/diarios/13959 -H 'Accept: application/octet-stream' the file is downloaded, but if I don't specify it - that's what happens when we yield the Gazette the downloaded content is just a pure JSON.

What do you think? Do you see any other solution than installing selenium?

@rodolfolottin rodolfolottin changed the title Add spider for Belem - Pará [WIP] Add spider for Belem - Pará Aug 17, 2020
@ogecece
Copy link
Member

ogecece commented Aug 17, 2020

Hey! Great spider and analysis of the problem.

Thankfully Scrapy already has a solution for this :) If you change the setting DEFAULT_REQUEST_HEADERS you can specify the 'Accept' header to include application/octet-stream.

I tested it adding this class attribute to the spider and the pdfs were correctly downloaded:

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/octet-stream'
        }
    }

Hope this helps!

@rodolfolottin
Copy link
Contributor Author

rodolfolottin commented Aug 17, 2020

You rock @giuliocc ! I didn't know about that

I think the PR is ready now 😄

@rodolfolottin rodolfolottin changed the title [WIP] Add spider for Belem - Pará Add spider for Belem - Pará Aug 17, 2020
@jvanz jvanz merged commit 739bee0 into okfn-brasil:main Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants