-
Notifications
You must be signed in to change notification settings - Fork 78
HCF experiments #27
base: master
Are you sure you want to change the base?
HCF experiments #27
Conversation
…essing is added (they are deleted as soon as all their requests are processed, not when the spider finishes with a success).
* remove HS_FRONTIER and HS_CONSUME_FROM settings; * use spider name as frontier by default; * allow overriding of target slot and frontier using Request meta params; * custom request serialization/deserialization using spider methods;
…ture for symmetry with hs_serialize_request
…ed in idle_spider
Hi, I did a quick review of the code instead of getting into the specific details, I have some global comments: I like the idea of reintroducing idle_spider, basically I see 2 main operation modes of the middleware:
I think both modes are useful and the middleware should support both and switch between them with settings. I think we can do this by adding settings to:
About the slot assignation I've been thinking and I believe the middleware should offer support for the most common cases (and the spider should always be able to override them if needed). I think we can use a policy based approach where the slot policy is defined by a setting with the following values:
I think is a good idea to add support to serialize/deserialize requests, I was thinking adding something like this, although I am not sure I like the idea of including the check for 'use_hcf' inside the serialize function. I think that logic should be left inside the calling code and it will decide whether to call serialize or not. The spider can always decide to set it or not at request creation time. About limiting the number of concurrently batches I think is better to do it by number of requests, currently the number of request by batch is 100, however this is an artifact of the current implementation and it may change in the future, if we limit by number of requests we won't need to worry about this changing later. What do you think? |
Hi, Thanks for a good feedback!
What do you think about extracting job rescheduling to a different extension? It could be useful for other spiders, and it should allows to remove some code (and configuration options) from HcfMiddleware. I must admit don't quite understand the existing start_job function though. E.g. why is hs_projectid passed to hsclient.start_job while "dummy" + consume_from_slot are passed to oldpanel_project.schedule? It also seems that HS_MAX_LINK feature is roughly equivalent to features of CloseSpider middleware - is it necessary? In the current implementation job is rescheduled to a spider's HS project. But spider's HS project it could be different from project at which job must be started. For example, if there are "standard" and "broad" spiders, then they will probably have different scrapy projects (with different settings). If "standard" spider schedules a request for "broad" spiders (via HCF) then they will probably use a single HS project despite being in different scrapy projects (several HS projects are currently unsupported in HcfMiddleware). Also, please excuse my ignorance - could you please explain when job rescheduling is preferred to a long-running spider with intermediate checkpoints? I see the following benefits of intermediate checkpoints: they
It could be argued that with 'slot' argument all of these policies are already supported :) One way scrapylib.hcf may help is just to provide some helper functions for slot generation, e.g.
This may be obvious, but I don't quite understand how DISTRIBUTED or BROAD policy will work. Who will start spiders with appropriate incoming slots (esp. in BROAD case when there could be hundreds of slots, and HCF doesn't have an API for slot enumeration)? I can also imagine other policies (e.g. put requests of one kind to slot A and requests of another kind to slot B) - for me it looks like slot assigning could be project-specific. I'm arguing for explict 'slot' passing because of this. Users will have to know exactly what slots are used to be able to start their spiders. Without auto-starting spiders bundling DISTRIBUTED and BROAD policies looks preliminary; with only a single policy (DEFAULT which is supported well) there is no need to go fancy and introduce policies. Could you explain how do you currently use HS_NUMBER_OF_SLOTS?
Initially use_hcf was out of request serialization/deserialization. But then I realized that I don't know why should we enforce use_hcf flag in case of custom serialization. Custom serializer could use callback-based routing, or it can just make HCF default. For example, a simplistic implementation of transparent serialization with callback support:
Also, So for greater flexibility I moved use_hcf to default serializer instead of hardcoding it somewhere else.
Currently HCF doesn't allow deleting individual requests, it works on batches - that's why the limit is on batches. Say we want to limit a number of requests between checkpoints. If user set it to 10 then we can't enforce this (because there could be no batches with <= 10 requests, and we need to process the whole batch to be able to delete it). If user sets it to 110 we can enfore it (just process a single batch, it is guaranteed to have less than 100 requests). So we need to validate the number of requests - it must be less than 100. But this is hardcoding a number "100" again. And what if user sets a 110 requests limit and then HCF is changed to return 200 items in batch? The user-set 110 requests limit will become invalid (unlike 5 batches limit which will be always valid regardless of HCF changes). Algorithm really works on batches, not on individual requests, that's why the limit is on batches. You're right that it is not pretty, but the exact value only affects how much data we loose in case of spider failure, and I think generally changing it (in a reasonable range) doesn't change much. We may implement a |
Hi, I am sorry for not getting back to you before I've been swamped with other tasks. ( I am posting the reply again, I answered by email and the format was all screwed up)
The disadvantage I see with using the CloseSpider middleware is that it is not aware of the HCF batches being processed, then when it issues the signal to close the spider a batch can be cut in the middle. The existing job rescheduling code needs some work, for example it does not work in the new panel, it throws some low level exception that I haven't had time to debug. In any case I think is useful in some cases and I think it would be good to keep it as a non default option. May be good idea to move it to its own middleware, this may be part of the refactoring that section need.
As far as I know the most common scenario is a single project with multiple spiders, if we need to have different settings then those are set at the spider level (either in code or with parameters). I would be curious in what case we need to have multiple projects.
Multiple jobs will have multiple results sets that can be more easily consumed. Right now I have a crawl with the following results pipeline:
With a long running job it will be harder to do this partial processing since there is no easy way to know which items were already processed. Also if we need to stop the crawl it is simple a matter of disabling the job scheduling and let the current jobs end. In a long running job we need to sent the close signal to the spider but then most likely it will stop on the middle of a batch.
The slot argument still requires additional work from the developer, I think we should provide slot functions for the most common cases.
A DISTRIBUTED crawl would be the case of a big site that we can to crawl faster than a single spider can. In this case we use HS_NUMBER_OF_SLOTS to define how many slots and spiders will be used. Then we start that number of spiders to crawl the site in parallel. The slot for each site is calculated with the formula:
A BROAD crawl is when we process many sites at the same time, therefore we need to use multiple spiders, in this case we use also HS_NUMBER_OF_SLOTS to define the number of slots and then the slot for each site is calculated with the formula:
This assures that all the urls from the same site will go to the same slot, but the user still controls the number of slots.
OK, I can go with this.
The way I see MAX_LINKS is similar as how the CloseSpider middleware limits work, right now if we set CLOSESPIDER_ITEMCOUNT=N it does not mean the spider will produce exactly N items but that it will stop when it reaches N (but most likely will produce a few more while the close signal is processed). Then MAX_LINKS is a similar soft limit, if we set it to 1000, then it will process batches until it reaches 1000 (but depending of the batch size it may process some more). The main advantage of doing this way is that we are not coupled to the HCF batch size, which is currently 100 but it may change down the road. If the batch size changes and we are counting batches the actual number of requests processed by the spider will change without the developer or user noticing (this may or may not be an issue) Regards |
Hi Gilberto, I failed to provide a feedback for your last comment, sorry about that! You're right that CloseSpider is not aware of HCF so it is better to have something that is HCF-aware (like limit on number of processed batches or a soft limit on number of processed requests). I'm still not convinced it is a good idea to have DISTRIBUTED / BROAD / etc. builtin - implementation in this PR doesn't disallow them, and built-in solutions like I'm also thinking of making Request serialization smarter. This is an utility I used in one of the projects:
The idea is the following: serialize request url, callback and priority automatically; also serialize all meta fields that are listed in meta['hcf_store_meta']. This way for many kinds of requests HCF middleware can be just turned on/off and things will work exactly the same. |
Or you can also use built-in https://github.com/scrapy/scrapy/blob/master/scrapy/utils/reqser.py |
@nramirezuy good point, didn't know about reqser.py. I'm not sure we should use it as-is though, because it seems that serializing full "meta" or full request body by default can be too much of overhead. |
Hi,
I don't know if this pull request should be actually merged (because the approach to HCFMiddleware is quite different and doesn't have all current features); it is opened more for a code review.
In this pull request Gilberto's HCF code is modified to use a different approach:
job re-scheduling is removed;
the number of concurrently processed HCF batches is limited (to 5 by default);
'idle_spider' is back, but its purpose is a bit different because of a previous point;
batches are deleted in 3 cases:
So, this introduces request-level tracking. We fetch a batch and track which requests from it are finished (i.e. spider handled them). I had hard time trying to track all cases (e.g. DNS errors can't be tracked outside downloader middleware because there is no signal or other indication for download error in scrapy; redirect middleware is also tricky) before realizing that with limited number of concurrently processed batches "spider_idle" signal provides a nice way to do "checkpoints": if spider is entered "idle" state then even if some of the requests from started batches are not marked as "done", they still can be deleted.
Request tracking puts some private garbage to request.meta. I think that now as we have limited batches concurrency, we can remove request tracking and stop marking batches as completed if spider doesn't finished normally. This will lead to a bigger number of duplicated items (because the whole last chunk of batches will be rescheduled), but as we have intermediate "checkpoints" in idle_spider, the last chunk won't be larger than HCF_MAX_CONCURRENT_BATCHES. Granular tracking could still de desired, so maybe it is better to just make it optional.
There are other API changes:
I don't have a local HubStorage installation, so there are zero new tests, and Gilberto's tests are likely broken now; the code works for me (I tried different things locally), but the first production run using it is not yet finished running.