Add spider for Belem - Pará #212

rodolfolottin · 2020-08-17T10:25:21Z

Description

Add spider for Belem - Pará. Issue #182 .

I've tried to run the spider built on PR 87 but it didn't work and based on the spider code I assumed that the website changed since it was built (2 years ago)

Despite that, this PR is still a WIP because I couldn't find a way to download the PDF files.

Problem description

On this spider I am using a different URL (the API one) but you can check in here that we can't get any pdf href based on the website DOM. So, as we can download the file just by clicking on the button I guess that the only possible solution that we have is to install selenium, right? Do you see any other possible solution?

As I couldn't resolve it by scraping the DOM - because of Javascript - and I wanted to check if there is any other option than downloading selenium I've decided to try to use the API url and see if I can hack a way out to download the gazette pdfs, but unfortunately I couldn't.

If I try to download the file specifying the Accept headers key, like here: curl -X GET https://sistemas.belem.pa.gov.br/diario-consulta-api/diarios/13959 -H 'Accept: application/octet-stream' the file is downloaded, but if I don't specify it - that's what happens when we yield the Gazette the downloaded content is just a pure JSON.

What do you think? Do you see any other solution than installing selenium?

ogecece · 2020-08-17T21:53:15Z

Hey! Great spider and analysis of the problem.

Thankfully Scrapy already has a solution for this :) If you change the setting DEFAULT_REQUEST_HEADERS you can specify the 'Accept' header to include application/octet-stream.

I tested it adding this class attribute to the spider and the pdfs were correctly downloaded:

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/octet-stream'
        }
    }

Hope this helps!

rodolfolottin · 2020-08-17T23:13:51Z

You rock @giuliocc ! I didn't know about that

I think the PR is ready now 😄

Add spider for Belem - Pará

0c9e6bc

rodolfolottin changed the title ~~Add spider for Belem - Pará~~ [WIP] Add spider for Belem - Pará Aug 17, 2020

rodolfolottin added 2 commits August 17, 2020 20:11

Update CITIES.md with PA Belem

5707073

Specify default request headers

1bbf125

rodolfolottin changed the title ~~[WIP] Add spider for Belem - Pará~~ Add spider for Belem - Pará Aug 17, 2020

jvanz approved these changes Aug 19, 2020

View reviewed changes

jvanz merged commit 739bee0 into okfn-brasil:main Aug 19, 2020

vitorbaptista mentioned this pull request Oct 10, 2020

add spider for Belém (PA) #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spider for Belem - Pará #212

Add spider for Belem - Pará #212

rodolfolottin commented Aug 17, 2020

ogecece commented Aug 17, 2020

rodolfolottin commented Aug 17, 2020 •

edited

Loading

Add spider for Belem - Pará #212

Add spider for Belem - Pará #212

Conversation

rodolfolottin commented Aug 17, 2020

Description

Problem description

ogecece commented Aug 17, 2020

rodolfolottin commented Aug 17, 2020 • edited Loading

rodolfolottin commented Aug 17, 2020 •

edited

Loading