-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IOP Spider: improve and add tests #206
base: master
Are you sure you want to change the base?
IOP Spider: improve and add tests #206
Conversation
2173f2c
to
8978e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments. More generally, I think you should try to refactor this, looking at what's still needed and what can be done better with the more powerful builder we have now (e.g. dates could use PartialDate
) and make sure that the XPath selectors are not too brittle.
hepcrawl/extractors/nlm.py
Outdated
@@ -146,10 +146,10 @@ def get_page_numbers(node): | |||
|
|||
fpage = node.xpath(".//FirstPage/text()").extract_first() | |||
lpage = node.xpath(".//LastPage/text()").extract_first() | |||
if fpage and lpage: | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already done by the literature builder, so not needed here.
hepcrawl/tohep.py
Outdated
@@ -243,7 +243,7 @@ def _filter_affiliation(affiliations): | |||
for author in crawler_record.get('authors', []): | |||
builder.add_author(builder.make_author( | |||
full_name=author['full_name'], | |||
affiliations=_filter_affiliation(author['affiliations']), | |||
affiliations=_filter_affiliation(author.get('affiliations', [])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _filter_affiliation
thing is not needed, as the builder is already cleaning up empty values. Besides, this should be raw_affiliations
instead of affiliations
, see #185.
@david-caro do we get feeds for IOP in the end? IIRC, @fschwenn was saying the other day that he's doing webscraping currently. |
I'm getting updates through email from the stacks service, I've contacted them to see if we can use oai, but no reply so far. |
Well, I just got a reply saying that there's an OAI service available :), now I asked for access. |
@szymonlopaciuk so I guess that it's safe to start working on the oai version of this spider ;) |
8978e83
to
3101fbf
Compare
tests/unit/test_parsers_nlm.py
Outdated
def test_field(field_name, expected, parser): | ||
# if field_name == 'authors': | ||
# import pdb | ||
# pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
tests/unit/test_parsers_nlm.py
Outdated
# import pdb | ||
# pdb.set_trace() | ||
|
||
result = getattr(parser, field_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert field_name in expected
tests/unit/test_parsers_nlm.py
Outdated
|
||
|
||
def test_print_publication_date(expected, parser): | ||
assert expected['print_publication_date'] == parser.print_publication_date.dumps() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert 'print_publication_date' in expected
@@ -0,0 +1,334 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This to it's own PR
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
3101fbf
to
471cf26
Compare
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
90987ae
to
75cc606
Compare
This adds test records from IOP and fixes some simple issues with IOP spider, to make the tests pass. Introduces a functional tests of the IOP spider. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
75cc606
to
c2757ab
Compare
Depends on #209.
Description
This adds test records from IOP and fixes issues with the IOP spider.
Related Issue
Fixes #205.
Checklist:
RFC
and look for it).