Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pos: update pos spider #160

Merged
merged 14 commits into from
Nov 1, 2017

Conversation

spirosdelviniotis
Copy link
Contributor

@spirosdelviniotis spirosdelviniotis commented Aug 10, 2017

Description

Signed-off-by: Spyridon Delviniotis [email protected]

Related Issue

Closes #159
On top of #155

Motivation and Context

Checklist:

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@spirosdelviniotis spirosdelviniotis self-assigned this Aug 10, 2017
@spirosdelviniotis spirosdelviniotis changed the title Hepcrawl pos spider pos: update pos spider Aug 10, 2017
@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_pos_spider branch 2 times, most recently from 47b036d to 4044840 Compare August 10, 2017 14:38

Example:
::

$ scrapy crawl PoS -a source_file=file://`pwd`/tests/responses/pos/sample_pos_record.xml
$ scrapy crawl PoS -a source_file=file://`pwd`/tests/unit/responses/pos/
sample_pos_record.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should use something like:

    $ my huge \\
        big long \\
        command

So then it renders nice in the page too and can be copied and pasted to the console, otherwise you get a huge command that goes out of the screen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

"""
name = 'PoS'
pos_base_url = "https://pos.sissa.it/contribution?id="
conference_paper_url = "https://pos.sissa.it/contribution?id="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is very confusing, usually for this kind of 'base' url thingies you use the 'base' word, like base_conference_papers_url.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all caps... though that is not used other other stuff here, so maybe leave the capitalization for a future refactor.

@@ -35,8 +41,13 @@ def scrape_pos_page_body():


@pytest.fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this fixture 'session', so it does not load it from disk every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_pos_spider branch 2 times, most recently from 8e67b40 to 63fd144 Compare August 21, 2017 11:49
request.meta["record"] = record.extract()
request.meta['url'] = response.url
request.meta['record'] = record.extract()
request.meta['identifier'] = identifier
yield request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for record in records:
    conf_paper_request = get_conf_paper(
        item_builder_method=build_conf_paper,
        ...
    )
    # Extracts the pdf from the paper url and builds the conference paper in a callback
    yield conf_paper_request

    conf_proceedings_requesw = get_conf_proceedings(...)
    # Extracts th url and builds the conference paper in a callback
    yield conf_proceedings_request

@@ -50,6 +50,7 @@ def set_up_oai_environment():
'CRAWLER_ARGUMENTS': {
'source_file': 'file://' + package_location,
'base_conference_paper_url': 'https://server.local/contribution?id=',
'base_proceedings_url': 'https://server.local/cgi-bin/reader/conf.cgi?confid=',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/setup_up_oai_environment/setup_environment/

@david-caro david-caro force-pushed the hepcrawl_pos_spider branch from ebe4380 to fc4ba5b Compare August 21, 2017 19:13
@@ -96,22 +96,22 @@ def parse(self, response):
response.selector.remove_namespaces()
records = response.selector.xpath('.//record')
for record in records:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably this should be:

for record_xml_selector in record_xml_selectors:

@@ -277,7 +278,7 @@ def build_conference_proceedings_item(
record.add_value('journal_title', 'PoS')
record.add_value(
'journal_volume',
self._get_journal_volume(pos_id=pos_id),
self._get_journal_volume(identifier=pos_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier is too generic, it's a pos internal identifier? a pos external one? orcid? inspire? what?

@@ -32,7 +32,7 @@ def override_generated_fields(record):


@pytest.fixture(scope="function")
def set_up_oai_environment():
def set_up_environment():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bad name, using a verb set_up hints that the function does something (has side effects).
In this case, it should probably be called something like configuration or similar, as what it does, is give back a configuration.

Another option is to have two fixtures instead of one, one with the wait, and the other with the config.

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_pos_spider branch 4 times, most recently from 0a33a49 to 8320ade Compare August 22, 2017 15:59
@spirosdelviniotis
Copy link
Contributor Author

Build fixed! 😄

@ghost ghost added the in progress label Oct 28, 2017
spirosdelviniotis and others added 14 commits November 1, 2017 14:57
Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#159

Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Addresses inspirehep#159

Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
@david-caro david-caro merged commit 4f33b74 into inspirehep:master Nov 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create/update PoS spider
2 participants