Adds: mechanism for crawling only once #162

spirosdelviniotis · 2017-08-17T13:24:41Z

Description

Adds: scrapy-crawl-once plug-in to setup.py.
Adds: extends scrapy-crawl-once plug-in for supporting custom data-fields in DB per spider.
Adds: enables scrapy-crawl-once plug-in.
Adds: tests about scrapy-clawl-once for FTP and FILE.
Re-factored: WSP spider and utils module in order not to check the crawled records.
Removes: unused function for arxiv tests.
Adds: create temporary folder to unzip the crawled files and in InspireCeleryPushPipeline.close_spider methods deletes this temporary folder.
Adds: new default argument to clean_dir method is the generated DB folder from scrapy-crawl-once plug-in.
Adds: to tests clean_dir to tests that using the DB.
Adds: makes staticmethod _get_collections class method.
Adds: minor indentation fix.
Removes: unused import.
Adds: make InspireAPIPushPipeline._cleanup staticmethod.
Adds: variable in settings.py to specify where to store scrapy-crawl-once DB.

Signed-off-by: Spyridon Delviniotis [email protected]

Related Issue

Closes #161

Motivation and Context

Checklist:

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

david-caro · 2017-08-17T14:11:09Z

hepcrawl/middlewares.py

+                if new_value != self.db.get(key=key_in_db):
+                    spider.logger.debug('Storing to DB - UPDATED file:: {}'.format(request.url))
+                    return
+            self.stats.inc_value('crawl_once/ignored')


@staticmethod def _is_newer(this_timestamp, than_this_timestamp): return this_timestamp < than_this_timestamp def _has_to_be_crawled(self, reques, spider): request_db_key = self._get_key(request) if not request_db_key in self.db: return True new_request_timestamp = self._get_timestamp(request, spider) parsed_url = urlparse(request.url) if self._is_newer( new_request_timestamp, self.db.get(key=request_db_key), ): return True return False def process_request(self, request, spider): # ... populate crawl_once_value and key if not request.meta.get('crawl_once', self.default): return if not self._has_to_be_crawled(request, spider): self.stats.inc_value('crawl_once/ignored') raise IgnoreRequest()

david-caro · 2017-08-18T13:24:33Z

hepcrawl/spiders/wsp_spider.py

        """Return this articles' collection."""
        conference = node.xpath('.//conference').extract()
-        if conference or current_journal_title == "International Journal of Modern Physics: Conference Series":
+        if conference or current_journal_title == "International Journal of Modern Physics:" \


if ( conference or current_journal_title == ( "basbalbalbal" "ntshnsthsnthsnth" ) ): ...

david-caro · 2017-08-21T12:51:13Z

hepcrawl/middlewares.py

+    """
+
+    def __init__(self, *args, **kwargs):
+        super(HepcrawlCrawlOnceMiddleware, self).__init__(*args, **kwargs)


david-caro · 2017-08-21T12:52:14Z

hepcrawl/middlewares.py

+        request.meta['crawl_once_key'] = os.path.basename(request.url)
+        request.meta['crawl_once_value'] = self._get_timestamp(request, spider)
+
+        if not request.meta.get('crawl_once', self.default):


move this before line 125

david-caro · 2017-08-21T12:54:35Z

hepcrawl/middlewares.py

+            file_path = full_url.replace(
+                '{0}://{1}/'.format(parsed_url.scheme, ftp_host),
+                '',
+            )


You can move this to a method called get_relative_path or similar so it's clearer.

david-caro · 2017-08-21T12:55:38Z

hepcrawl/middlewares.py

+        parsed_url = urlparse(request.url)
+        full_url = request.url
+        if parsed_url.scheme == 'ftp':
+            ftp_host, params = ftp_connection_info(spider.ftp_host, spider.ftp_netrc)


You can move all this if block to get_ftp_timestamp

david-caro · 2017-08-21T13:10:31Z

tests/fix_rights.c

@@ -11,6 +11,7 @@
 char *VENV_PATH = "/hepcrawl_venv/";
 char *CODE_PATH = "/code/";
 char *TMP_PATH = "/tmp/";
+char *VAR_PATH = "/var/";


instead of this you should use a configuration override to set it to somewhere inside the test directory.

Signed-off-by: Spiros Delviniotis <[email protected]>

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: makes staticmethod `_get_collections` class method. * Adds: minor indentation fix. * Removes: unused import. Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: make `InspireAPIPushPipeline._cleanup` staticmethod. Signed-off-by: Spiros Delviniotis <[email protected]>

* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

Signed-off-by: David Caro <[email protected]>

That allows to properly wait for the services to be up, instead of adding a dummy sleep. Signed-off-by: David Caro <[email protected]>

Signed-off-by: David Caro <[email protected]>

That way uses the schema (ftp/http/local...) and the file name as key, instead of just the file name, to avoid downloading a file from ftp and not crawling it after. Signed-off-by: David Caro <[email protected]>

Signed-off-by: David Caro <[email protected]>

As It changes the actual url of the files to download, the hash also changes. Signed-off-by: David Caro <[email protected]>

Signed-off-by: David Caro <[email protected]>

spirosdelviniotis added the Status: WIP label Aug 17, 2017

spirosdelviniotis added this to the Desy spider & PoS specification milestone Aug 17, 2017

spirosdelviniotis self-assigned this Aug 17, 2017

spirosdelviniotis requested a review from david-caro August 17, 2017 13:24

david-caro reviewed Aug 17, 2017

View reviewed changes

spirosdelviniotis force-pushed the hepcrawl_crawl_once branch 2 times, most recently from 7d832ae to c6a027a Compare August 18, 2017 13:01

david-caro reviewed Aug 18, 2017

View reviewed changes

spirosdelviniotis force-pushed the hepcrawl_crawl_once branch from 22a8457 to f694856 Compare August 18, 2017 14:15

spirosdelviniotis added Status: Ready for review and removed Status: WIP labels Aug 18, 2017

david-caro reviewed Aug 21, 2017

View reviewed changes

spirosdelviniotis force-pushed the hepcrawl_crawl_once branch 2 times, most recently from abf2c97 to 9d28bfd Compare August 21, 2017 15:06

david-caro force-pushed the hepcrawl_crawl_once branch from 9d28bfd to cd67e47 Compare September 20, 2017 12:00

spirosdelviniotis added 11 commits September 20, 2017 14:42

middlwares: indentation fix

b97cd4b

Signed-off-by: Spiros Delviniotis <[email protected]>

setup: add scrapy-crawl-once plug-in

15ab37d

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

middlewares: add support for crawling only once

7694fe3

* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

tests: support for scrapy-crawl-once

a283b21

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

tests: add tests for scrapy-clawl-once

1301d25

* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

tests: delete unused function for arxiv tests

98a7796

Signed-off-by: Spiros Delviniotis <[email protected]>

wsp: add temporary folder to the crawlings

a758819

* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

testlib: refactored clean_dir default arguments

28a2603

* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

tests: add clean_dir to tests that using the DB

026bdfb

Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

wsp: minor fix

e7ff520

* Adds: makes staticmethod `_get_collections` class method. * Adds: minor indentation fix. * Removes: unused import. Signed-off-by: Spiros Delviniotis <[email protected]>

pipelines: minor fix

b509c82

* Adds: make `InspireAPIPushPipeline._cleanup` staticmethod. Signed-off-by: Spiros Delviniotis <[email protected]>

spirosdelviniotis and others added 15 commits September 20, 2017 14:42

global: add CRAWL_ONCE_PATH to settings

a82a158

* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>

middlewares: refactor crawl once middleware

5a358e1

Signed-off-by: David Caro <[email protected]>

pep8: wsp_spider

16ae688

Signed-off-by: David Caro <[email protected]>

setup: pin scrapy-crawl-once to the major version

1ac8785

Signed-off-by: David Caro <[email protected]>

tests: remove unneeded var dir chown

6dffdc8

Signed-off-by: David Caro <[email protected]>

unit: small pep8 refactor

6ffa2df

Signed-off-by: David Caro <[email protected]>

gitignore: add some test product files

ddda124

Signed-off-by: David Caro <[email protected]>

docker-compose: swap sleep with healthcheck

610fce4

That allows to properly wait for the services to be up, instead of adding a dummy sleep. Signed-off-by: David Caro <[email protected]>

utils: nicer ParsedItem string reperesentation

ff1eb86

Signed-off-by: David Caro <[email protected]>

celery_monitor: small refactor and default event fix

f111993

Signed-off-by: David Caro <[email protected]>

tohep: add some useful debug logs

bcf8f76

Signed-off-by: David Caro <[email protected]>

middlewares: use better key for crawl-once

5aacb1c

That way uses the schema (ftp/http/local...) and the file name as key, instead of just the file name, to avoid downloading a file from ftp and not crawling it after. Signed-off-by: David Caro <[email protected]>

pipelines: added extra debug log

1a53584

Signed-off-by: David Caro <[email protected]>

desy: adapt to the new middleware

3110b0f

As It changes the actual url of the files to download, the hash also changes. Signed-off-by: David Caro <[email protected]>

wsp: adapt to new middleware and refactor

bd72a4f

Signed-off-by: David Caro <[email protected]>

david-caro force-pushed the hepcrawl_crawl_once branch from cd67e47 to bd72a4f Compare September 20, 2017 12:42

david-caro approved these changes Sep 20, 2017

View reviewed changes

david-caro merged commit d98cb2d into inspirehep:master Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds: mechanism for crawling only once #162

Adds: mechanism for crawling only once #162

spirosdelviniotis commented Aug 17, 2017 •

edited by david-caro

Loading

david-caro Aug 17, 2017 •

edited

Loading

david-caro Aug 18, 2017

david-caro Aug 21, 2017

david-caro Aug 21, 2017

david-caro Aug 21, 2017

david-caro Aug 21, 2017

david-caro Aug 21, 2017

Adds: mechanism for crawling only once #162

Adds: mechanism for crawling only once #162

Conversation

spirosdelviniotis commented Aug 17, 2017 • edited by david-caro Loading

Description

Related Issue

Motivation and Context

Checklist:

david-caro Aug 17, 2017 • edited Loading

Choose a reason for hiding this comment

david-caro Aug 18, 2017

Choose a reason for hiding this comment

david-caro Aug 21, 2017

Choose a reason for hiding this comment

david-caro Aug 21, 2017

Choose a reason for hiding this comment

david-caro Aug 21, 2017

Choose a reason for hiding this comment

david-caro Aug 21, 2017

Choose a reason for hiding this comment

david-caro Aug 21, 2017

Choose a reason for hiding this comment

spirosdelviniotis commented Aug 17, 2017 •

edited by david-caro

Loading

david-caro Aug 17, 2017 •

edited

Loading