Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds: mechanism for crawling only once #162

Merged
merged 26 commits into from
Sep 20, 2017

Conversation

spirosdelviniotis
Copy link
Contributor

@spirosdelviniotis spirosdelviniotis commented Aug 17, 2017

Description

  • Adds: scrapy-crawl-once plug-in to setup.py.
  • Adds: extends scrapy-crawl-once plug-in for supporting custom data-fields in DB per spider.
  • Adds: enables scrapy-crawl-once plug-in.
  • Adds: tests about scrapy-clawl-once for FTP and FILE.
  • Re-factored: WSP spider and utils module in order not to check the crawled records.
  • Removes: unused function for arxiv tests.
  • Adds: create temporary folder to unzip the crawled files and in InspireCeleryPushPipeline.close_spider methods deletes this temporary folder.
  • Adds: new default argument to clean_dir method is the generated DB folder from scrapy-crawl-once plug-in.
  • Adds: to tests clean_dir to tests that using the DB.
  • Adds: makes staticmethod _get_collections class method.
  • Adds: minor indentation fix.
  • Removes: unused import.
  • Adds: make InspireAPIPushPipeline._cleanup staticmethod.
  • Adds: variable in settings.py to specify where to store scrapy-crawl-once DB.

Signed-off-by: Spyridon Delviniotis [email protected]

Related Issue

Closes #161

Motivation and Context

Checklist:

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

if new_value != self.db.get(key=key_in_db):
spider.logger.debug('Storing to DB - UPDATED file:: {}'.format(request.url))
return
self.stats.inc_value('crawl_once/ignored')
Copy link
Contributor

@david-caro david-caro Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staticmethod
def _is_newer(this_timestamp, than_this_timestamp):
    return this_timestamp < than_this_timestamp

def _has_to_be_crawled(self, reques, spider):    
        request_db_key = self._get_key(request)

        if not request_db_key in self.db:
            return True

        new_request_timestamp = self._get_timestamp(request, spider)
        parsed_url = urlparse(request.url)
        if self._is_newer(
            new_request_timestamp,
            self.db.get(key=request_db_key),
        ):
                    return True

    return False

def process_request(self, request, spider):
   # ... populate crawl_once_value and key
    if not request.meta.get('crawl_once', self.default):
        return

    if not self._has_to_be_crawled(request, spider):
        self.stats.inc_value('crawl_once/ignored')
        raise IgnoreRequest()

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_crawl_once branch 2 times, most recently from 7d832ae to c6a027a Compare August 18, 2017 13:01
"""Return this articles' collection."""
conference = node.xpath('.//conference').extract()
if conference or current_journal_title == "International Journal of Modern Physics: Conference Series":
if conference or current_journal_title == "International Journal of Modern Physics:" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (
    conference or
    current_journal_title == (
        "basbalbalbal"
        "ntshnsthsnthsnth"
    )
):
    ...

"""

def __init__(self, *args, **kwargs):
super(HepcrawlCrawlOnceMiddleware, self).__init__(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

request.meta['crawl_once_key'] = os.path.basename(request.url)
request.meta['crawl_once_value'] = self._get_timestamp(request, spider)

if not request.meta.get('crawl_once', self.default):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this before line 125

file_path = full_url.replace(
'{0}://{1}/'.format(parsed_url.scheme, ftp_host),
'',
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this to a method called get_relative_path or similar so it's clearer.

parsed_url = urlparse(request.url)
full_url = request.url
if parsed_url.scheme == 'ftp':
ftp_host, params = ftp_connection_info(spider.ftp_host, spider.ftp_netrc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move all this if block to get_ftp_timestamp

@@ -11,6 +11,7 @@
char *VENV_PATH = "/hepcrawl_venv/";
char *CODE_PATH = "/code/";
char *TMP_PATH = "/tmp/";
char *VAR_PATH = "/var/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this you should use a configuration override to set it to somewhere inside the test directory.

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_crawl_once branch 2 times, most recently from abf2c97 to 9d28bfd Compare August 21, 2017 15:06
Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: tests about `scrapy-clawl-once` for FTP and FILE.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: makes staticmethod `_get_collections` class method.
* Adds: minor indentation fix.
* Removes: unused import.

Signed-off-by: Spiros Delviniotis <[email protected]>
* Adds: make `InspireAPIPushPipeline._cleanup` staticmethod.

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis and others added 15 commits September 20, 2017 14:42
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
Signed-off-by: David Caro <[email protected]>
Signed-off-by: David Caro <[email protected]>
That allows to properly wait for the services to be up, instead of
adding a dummy sleep.

Signed-off-by: David Caro <[email protected]>
That way uses the schema (ftp/http/local...) and the file name as
key, instead of just the file name, to avoid downloading a file from
ftp and not crawling it after.

Signed-off-by: David Caro <[email protected]>
As It changes the actual url of the files to download, the hash also
changes.

Signed-off-by: David Caro <[email protected]>
@david-caro david-caro merged commit d98cb2d into inspirehep:master Sep 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add mechanism for crawling only once
2 participants