-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pos: update pos spider #160
pos: update pos spider #160
Conversation
5f8fe7f
to
f0f9cc3
Compare
47b036d
to
4044840
Compare
hepcrawl/spiders/pos_spider.py
Outdated
|
||
Example: | ||
:: | ||
|
||
$ scrapy crawl PoS -a source_file=file://`pwd`/tests/responses/pos/sample_pos_record.xml | ||
$ scrapy crawl PoS -a source_file=file://`pwd`/tests/unit/responses/pos/ | ||
sample_pos_record.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you should use something like:
$ my huge \\
big long \\
command
So then it renders nice in the page too and can be copied and pasted to the console, otherwise you get a huge command that goes out of the screen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
hepcrawl/spiders/pos_spider.py
Outdated
""" | ||
name = 'PoS' | ||
pos_base_url = "https://pos.sissa.it/contribution?id=" | ||
conference_paper_url = "https://pos.sissa.it/contribution?id=" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is very confusing, usually for this kind of 'base' url thingies you use the 'base' word, like base_conference_papers_url
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And all caps... though that is not used other other stuff here, so maybe leave the capitalization for a future refactor.
tests/unit/test_pos.py
Outdated
@@ -35,8 +41,13 @@ def scrape_pos_page_body(): | |||
|
|||
|
|||
@pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make this fixture 'session', so it does not load it from disk every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
4044840
to
1d6e8e8
Compare
8e67b40
to
63fd144
Compare
hepcrawl/spiders/pos_spider.py
Outdated
request.meta["record"] = record.extract() | ||
request.meta['url'] = response.url | ||
request.meta['record'] = record.extract() | ||
request.meta['identifier'] = identifier | ||
yield request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for record in records:
conf_paper_request = get_conf_paper(
item_builder_method=build_conf_paper,
...
)
# Extracts the pdf from the paper url and builds the conference paper in a callback
yield conf_paper_request
conf_proceedings_requesw = get_conf_proceedings(...)
# Extracts th url and builds the conference paper in a callback
yield conf_proceedings_request
tests/functional/pos/test_pos.py
Outdated
@@ -50,6 +50,7 @@ def set_up_oai_environment(): | |||
'CRAWLER_ARGUMENTS': { | |||
'source_file': 'file://' + package_location, | |||
'base_conference_paper_url': 'https://server.local/contribution?id=', | |||
'base_proceedings_url': 'https://server.local/cgi-bin/reader/conf.cgi?confid=', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/setup_up_oai_environment/setup_environment/
ebe4380
to
fc4ba5b
Compare
hepcrawl/spiders/pos_spider.py
Outdated
@@ -96,22 +96,22 @@ def parse(self, response): | |||
response.selector.remove_namespaces() | |||
records = response.selector.xpath('.//record') | |||
for record in records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably this should be:
for record_xml_selector in record_xml_selectors:
hepcrawl/spiders/pos_spider.py
Outdated
@@ -277,7 +278,7 @@ def build_conference_proceedings_item( | |||
record.add_value('journal_title', 'PoS') | |||
record.add_value( | |||
'journal_volume', | |||
self._get_journal_volume(pos_id=pos_id), | |||
self._get_journal_volume(identifier=pos_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifier is too generic, it's a pos internal identifier? a pos external one? orcid? inspire? what?
tests/functional/pos/test_pos.py
Outdated
@@ -32,7 +32,7 @@ def override_generated_fields(record): | |||
|
|||
|
|||
@pytest.fixture(scope="function") | |||
def set_up_oai_environment(): | |||
def set_up_environment(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a bad name, using a verb set_up
hints that the function does something (has side effects).
In this case, it should probably be called something like configuration
or similar, as what it does, is give back a configuration.
Another option is to have two fixtures instead of one, one with the wait, and the other with the config.
0a33a49
to
8320ade
Compare
Build fixed! 😄 |
8320ade
to
16d2cd7
Compare
16d2cd7
to
8cd9ca1
Compare
8cd9ca1
to
c60a4e5
Compare
Signed-off-by: Spiros Delviniotis <[email protected]> Signed-off-by: David Caro <[email protected]>
Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#159 Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Addresses inspirehep#159 Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
c60a4e5
to
1dd708d
Compare
Description
Signed-off-by: Spyridon Delviniotis [email protected]
Related Issue
Closes #159
On top of #155
Motivation and Context
Checklist:
RFC
and look for it).