Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism for crawling only once #161

Closed
spirosdelviniotis opened this issue Aug 17, 2017 · 0 comments · Fixed by #162
Closed

Add mechanism for crawling only once #161

spirosdelviniotis opened this issue Aug 17, 2017 · 0 comments · Fixed by #162
Assignees

Comments

@spirosdelviniotis
Copy link
Contributor

We have to find a way not to crawl many times the same records.

Expected Behavior

We are going to extend the scrapy-crawl-once plug-in.

Current Behavior

Hepcrawl re-crawls records generated from previous executions.

Steps to Reproduce (for bugs)

  1. Adapt scrapy-crawl-once plug-in to Hepcrawl.
  2. Extend the scrapy-crawl-once plug-in in a way that stores in the DB a key-value record for every request. As key we have the unique file name (FTP-FILE requests) or the unique id in the parameters (HTTP-HTTPS requests). As value we store the last-modified time stamp (FTP-FILE requests) or the crawling time stamp (HTTP-HTTPS requests).

Context

We are trying to crawl only once every record.

Screenshots (if appropriate):

@spirosdelviniotis spirosdelviniotis self-assigned this Aug 17, 2017
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 17, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 17, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 17, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 17, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 17, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 18, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis added a commit to spirosdelviniotis/hepcrawl that referenced this issue Aug 21, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
david-caro pushed a commit to david-caro/hepcrawl that referenced this issue Aug 25, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider.
* Adds: enables `scrapy-crawl-once` plug-in.

Closes inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider`
	methods deletes this temporary folder.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
david-caro pushed a commit to spirosdelviniotis/hepcrawl that referenced this issue Sep 20, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB.

Addresses inspirehep#161

Signed-off-by: Spiros Delviniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant