-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mechanism for crawling only once #161
Milestone
Comments
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 17, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 17, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 17, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 17, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 17, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 18, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
8 tasks
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
spirosdelviniotis
added a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Aug 21, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Re-factored: `WSP` spider and `utils` module in order not to check the crawled records. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to david-caro/hepcrawl
that referenced
this issue
Aug 25, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: extends `scrapy-crawl-once` plug-in for supporting custom data-fields in DB per spider. * Adds: enables `scrapy-crawl-once` plug-in. Closes inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: tests about `scrapy-clawl-once` for FTP and FILE. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: create temporary folder to unzip the crawled files and in `InspireCeleryPushPipeline.close_spider` methods deletes this temporary folder. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: new default argument to `clean_dir` method is the generated DB folder from `scrapy-crawl-once` plugin. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
david-caro
pushed a commit
to spirosdelviniotis/hepcrawl
that referenced
this issue
Sep 20, 2017
* Adds: variable in `settings.py` to specify where to store `scrapy-crawl-once` DB. Addresses inspirehep#161 Signed-off-by: Spiros Delviniotis <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We have to find a way not to crawl many times the same records.
Expected Behavior
We are going to extend the
scrapy-crawl-once
plug-in.Current Behavior
Hepcrawl re-crawls records generated from previous executions.
Steps to Reproduce (for bugs)
scrapy-crawl-once
plug-in to Hepcrawl.scrapy-crawl-once
plug-in in a way that stores in the DB akey-value
record for every request. Askey
we have the uniquefile name
(FTP-FILE requests) or the uniqueid
in the parameters (HTTP-HTTPS requests). Asvalue
we store thelast-modified
time stamp (FTP-FILE requests) or the crawling time stamp (HTTP-HTTPS requests).Context
We are trying to crawl only once every record.
Screenshots (if appropriate):
The text was updated successfully, but these errors were encountered: