-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds: mechanism for crawling only once #162
Adds: mechanism for crawling only once #162
Conversation
hepcrawl/middlewares.py
Outdated
if new_value != self.db.get(key=key_in_db): | ||
spider.logger.debug('Storing to DB - UPDATED file:: {}'.format(request.url)) | ||
return | ||
self.stats.inc_value('crawl_once/ignored') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staticmethod
def _is_newer(this_timestamp, than_this_timestamp):
return this_timestamp < than_this_timestamp
def _has_to_be_crawled(self, reques, spider):
request_db_key = self._get_key(request)
if not request_db_key in self.db:
return True
new_request_timestamp = self._get_timestamp(request, spider)
parsed_url = urlparse(request.url)
if self._is_newer(
new_request_timestamp,
self.db.get(key=request_db_key),
):
return True
return False
def process_request(self, request, spider):
# ... populate crawl_once_value and key
if not request.meta.get('crawl_once', self.default):
return
if not self._has_to_be_crawled(request, spider):
self.stats.inc_value('crawl_once/ignored')
raise IgnoreRequest()
7d832ae
to
c6a027a
Compare
hepcrawl/spiders/wsp_spider.py
Outdated
"""Return this articles' collection.""" | ||
conference = node.xpath('.//conference').extract() | ||
if conference or current_journal_title == "International Journal of Modern Physics: Conference Series": | ||
if conference or current_journal_title == "International Journal of Modern Physics:" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (
conference or
current_journal_title == (
"basbalbalbal"
"ntshnsthsnthsnth"
)
):
...
22a8457
to
f694856
Compare
hepcrawl/middlewares.py
Outdated
""" | ||
|
||
def __init__(self, *args, **kwargs): | ||
super(HepcrawlCrawlOnceMiddleware, self).__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
hepcrawl/middlewares.py
Outdated
request.meta['crawl_once_key'] = os.path.basename(request.url) | ||
request.meta['crawl_once_value'] = self._get_timestamp(request, spider) | ||
|
||
if not request.meta.get('crawl_once', self.default): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this before line 125
hepcrawl/middlewares.py
Outdated
file_path = full_url.replace( | ||
'{0}://{1}/'.format(parsed_url.scheme, ftp_host), | ||
'', | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move this to a method called get_relative_path
or similar so it's clearer.
hepcrawl/middlewares.py
Outdated
parsed_url = urlparse(request.url) | ||
full_url = request.url | ||
if parsed_url.scheme == 'ftp': | ||
ftp_host, params = ftp_connection_info(spider.ftp_host, spider.ftp_netrc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move all this if block to get_ftp_timestamp
tests/fix_rights.c
Outdated
@@ -11,6 +11,7 @@ | |||
char *VENV_PATH = "/hepcrawl_venv/"; | |||
char *CODE_PATH = "/code/"; | |||
char *TMP_PATH = "/tmp/"; | |||
char *VAR_PATH = "/var/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this you should use a configuration override to set it to somewhere inside the test directory.
abf2c97
to
9d28bfd
Compare
9d28bfd
to
cd67e47
Compare
Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: makes staticmethod `_get_collections` class method. * Adds: minor indentation fix. * Removes: unused import. Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: make `InspireAPIPushPipeline._cleanup` staticmethod. Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
That allows to properly wait for the services to be up, instead of adding a dummy sleep. Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
That way uses the schema (ftp/http/local...) and the file name as key, instead of just the file name, to avoid downloading a file from ftp and not crawling it after. Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
As It changes the actual url of the files to download, the hash also changes. Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
cd67e47
to
bd72a4f
Compare
Description
scrapy-crawl-once
plug-in tosetup.py
.scrapy-crawl-once
plug-in for supporting custom data-fields in DB per spider.scrapy-crawl-once
plug-in.scrapy-clawl-once
for FTP and FILE.WSP
spider andutils
module in order not to check the crawled records.arxiv
tests.InspireCeleryPushPipeline.close_spider
methods deletes this temporary folder.clean_dir
method is the generated DB folder fromscrapy-crawl-once
plug-in.clean_dir
to tests that using the DB._get_collections
class method.InspireAPIPushPipeline._cleanup
staticmethod.settings.py
to specify where to storescrapy-crawl-once
DB.Signed-off-by: Spyridon Delviniotis [email protected]
Related Issue
Closes #161
Motivation and Context
Checklist:
RFC
and look for it).