Skip to content
This repository has been archived by the owner on Jul 19, 2018. It is now read-only.

HCF experiments #27

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

HCF experiments #27

wants to merge 14 commits into from

Conversation

kmike
Copy link
Member

@kmike kmike commented Jun 10, 2013

Hi,

I don't know if this pull request should be actually merged (because the approach to HCFMiddleware is quite different and doesn't have all current features); it is opened more for a code review.

In this pull request Gilberto's HCF code is modified to use a different approach:

  • job re-scheduling is removed;

  • the number of concurrently processed HCF batches is limited (to 5 by default);

  • 'idle_spider' is back, but its purpose is a bit different because of a previous point;

  • batches are deleted in 3 cases:

    1. when spider finished normally;
    2. when spider finished handling of requests from all currently scheduled batches and entered 'idle' state;
    3. when spider finished in an unclean fashion and all requests from a batch were handled (batches with remaining requests are not deleted).

So, this introduces request-level tracking. We fetch a batch and track which requests from it are finished (i.e. spider handled them). I had hard time trying to track all cases (e.g. DNS errors can't be tracked outside downloader middleware because there is no signal or other indication for download error in scrapy; redirect middleware is also tricky) before realizing that with limited number of concurrently processed batches "spider_idle" signal provides a nice way to do "checkpoints": if spider is entered "idle" state then even if some of the requests from started batches are not marked as "done", they still can be deleted.

Request tracking puts some private garbage to request.meta. I think that now as we have limited batches concurrency, we can remove request tracking and stop marking batches as completed if spider doesn't finished normally. This will lead to a bigger number of duplicated items (because the whole last chunk of batches will be rescheduled), but as we have intermediate "checkpoints" in idle_spider, the last chunk won't be larger than HCF_MAX_CONCURRENT_BATCHES. Granular tracking could still de desired, so maybe it is better to just make it optional.

There are other API changes:

  • target slot and frontier are now set via Request meta arguments (if missing, they equals to spider's incoming slot/frontier);
  • incoming slot and frontier are set via spider attributes; by default, input and output frontier is a spider name; slot is just '0' by default;
  • with previous 2 changes global HS_SLOT and HS_FRONTIER options are no longer needed (and they are removed);
  • I removed HS_NUMBER_OF_SLOTS support for now because it can be easily implemented in user code (by passing different slot values as Request arguments); slot naming could be project-specific, and there is no support for multiple input slots built-in anyway;
  • spiders can implement custom Request serialization/deserialization (so non-url fingerprints and custom callbacks could are supported in user code, and use_hcf option is not required).

I don't have a local HubStorage installation, so there are zero new tests, and Gilberto's tests are likely broken now; the code works for me (I tried different things locally), but the first production run using it is not yet finished running.

kmike added 13 commits June 6, 2013 06:44
…essing is added (they are deleted as soon as all their requests are processed, not when the spider finishes with a success).
* remove HS_FRONTIER and HS_CONSUME_FROM settings;
* use spider name as frontier by default;
* allow overriding of target slot and frontier using Request meta params;
* custom request serialization/deserialization using spider methods;
@andresp99999
Copy link
Contributor

Hi,

I did a quick review of the code instead of getting into the specific details, I have some global comments:

I like the idea of reintroducing idle_spider, basically I see 2 main operation modes of the middleware:

  • Lots of small jobs where every job processes a fixed set of batches, each job is a checkpoint. This is how the current code works.
  • Long running jobs where we each job have multiple checkpoints based in idle_spider. This is the approach of this PR

I think both modes are useful and the middleware should support both and switch between them with settings. I think we can do this by adding settings to:

  • Enable/Disable idle based checkpoint.
  • Enable/Disable individual request tracking.
  • Enable/Disable job rescheduling.

About the slot assignation I've been thinking and I believe the middleware should offer support for the most common cases (and the spider should always be able to override them if needed). I think we can use a policy based approach where the slot policy is defined by a setting with the following values:

  • DEFAULT. This policy will have the spider writing to a single slot. It can be used in the case where we just need the persistence provided by the HCF without needing to use multiple slots for scaling.

  • DISTRIBUTED. This policy will have the spider crawling a single domain but writing to multiple slots (number of slots controlled by a setting). This is the case when we are crawling a high performance site and we want to speed up the crawl with multiple instances of the spider. The slots will be assigned using a hash of the url/fingerprint.

  • BROAD. This policy will have the spider crawling from multiple domains and writing to multiple slots (number of slots controlled by a setting). The slot will be assigned based in a domain, so all the urls from a single domain end in the same slot. This is useful for broad crawls.

    Note that in all cases an spider always reads from a single slot. I also see the following override hierarchy for the slot:

       1. Policy
       2. Spider instance variable
       3. Request meta key
    

    For the frontier I like the idea of using the spider name by default, then the frontier will be defined by:

     1. Spider name
     2. Spider instance variable
     3. Request meta key.
    

I think is a good idea to add support to serialize/deserialize requests, I was thinking adding something like this, although I am not sure I like the idea of including the check for 'use_hcf' inside the serialize function. I think that logic should be left inside the calling code and it will decide whether to call serialize or not. The spider can always decide to set it or not at request creation time.

About limiting the number of concurrently batches I think is better to do it by number of requests, currently the number of request by batch is 100, however this is an artifact of the current implementation and it may change in the future, if we limit by number of requests we won't need to worry about this changing later.

What do you think?

@kmike
Copy link
Member Author

kmike commented Jun 14, 2013

Hi,

Thanks for a good feedback!

I like the idea of reintroducing idle_spider, basically I see 2 main operation modes of the middleware:

Lots of small jobs where every job processes a fixed set of batches, each job is a checkpoint. This is how the current code works.

Long running jobs where we each job have multiple checkpoints based in idle_spider. This is the approach of this PR

I think both modes are useful and the middleware should support both and switch between them with settings. I think we can do this by adding settings to:

Enable/Disable idle based checkpoint.
Enable/Disable individual request tracking.
Enable/Disable job rescheduling.

What do you think about extracting job rescheduling to a different extension? It could be useful for other spiders, and it should allows to remove some code (and configuration options) from HcfMiddleware. I must admit don't quite understand the existing start_job function though. E.g. why is hs_projectid passed to hsclient.start_job while "dummy" + consume_from_slot are passed to oldpanel_project.schedule? It also seems that HS_MAX_LINK feature is roughly equivalent to features of CloseSpider middleware - is it necessary?

In the current implementation job is rescheduled to a spider's HS project. But spider's HS project it could be different from project at which job must be started. For example, if there are "standard" and "broad" spiders, then they will probably have different scrapy projects (with different settings). If "standard" spider schedules a request for "broad" spiders (via HCF) then they will probably use a single HS project despite being in different scrapy projects (several HS projects are currently unsupported in HcfMiddleware).

Also, please excuse my ignorance - could you please explain when job rescheduling is preferred to a long-running spider with intermediate checkpoints? I see the following benefits of intermediate checkpoints: they

  • looks like more efficient (as there is no spider reloading);
  • allow in-process memory usage;
  • don't have hs_project issue (described above);
  • don't require Panel to work;
  • provide combined resulting data and combined stats without extra work;
  • allow batches to be deleted more often.

About the slot assignation I've been thinking and I believe the middleware should offer support for the most common cases (and the spider should always be able to override them if needed). I think we can use a policy based approach where the slot policy is defined by a setting with the following values:

DEFAULT. This policy will have the spider writing to a single slot. It can be used in the case where we just need the persistence provided by the HCF without needing to use multiple slots for scaling.

DISTRIBUTED. This policy will have the spider crawling a single domain but writing to multiple slots (number of slots controlled by a setting). This is the case when we are crawling a high performance site and we want to speed up the crawl with multiple instances of the spider. The slots will be assigned using a hash of the url/fingerprint.

BROAD. This policy will have the spider crawling from multiple domains and writing to multiple slots (number of slots controlled by a setting). The slot will be assigned based in a domain, so all the urls from a single domain end in the same slot. This is useful for broad crawls.

It could be argued that with 'slot' argument all of these policies are already supported :) One way scrapylib.hcf may help is just to provide some helper functions for slot generation, e.g.

from scrapylib.hcf inport distributed_slot
# ...
    slot = distributed_slot(url, self.HS_SLOT_COUNT)
    yield Request(url, meta={'hcf_params': {'slot': slot}})

Note that in all cases an spider always reads from a single slot.

This may be obvious, but I don't quite understand how DISTRIBUTED or BROAD policy will work. Who will start spiders with appropriate incoming slots (esp. in BROAD case when there could be hundreds of slots, and HCF doesn't have an API for slot enumeration)? I can also imagine other policies (e.g. put requests of one kind to slot A and requests of another kind to slot B) - for me it looks like slot assigning could be project-specific. I'm arguing for explict 'slot' passing because of this. Users will have to know exactly what slots are used to be able to start their spiders. Without auto-starting spiders bundling DISTRIBUTED and BROAD policies looks preliminary; with only a single policy (DEFAULT which is supported well) there is no need to go fancy and introduce policies. Could you explain how do you currently use HS_NUMBER_OF_SLOTS?

I think is a good idea to add support to serialize/deserialize requests, I was thinking adding something like this, although I am not sure I like the idea of including the check for 'use_hcf' inside the serialize function. I think that logic should be left inside the calling code and it will decide whether to call serialize or not. The spider can always decide to set it or not at request creation time.

Initially use_hcf was out of request serialization/deserialization. But then I realized that I don't know why should we enforce use_hcf flag in case of custom serialization. Custom serializer could use callback-based routing, or it can just make HCF default. For example, a simplistic implementation of transparent serialization with callback support:

    def hcf_serialize_request(self, request):
        return {
            'fp': request.url,
            'qdata': {'callback_name': request.callback.__name__}
        }

    def hcf_deserialize_request(self, hcf_params, batch_id):
        callback = getattr(self, hcf_params['qdata']['callback_name'])
        return Request(hcf_params['fp'], callback)

Also, use_hcf flag could be misleading if middleware is not active. If spider implements custom serialization then it is not like it became using HCF unintentionally (this is what default use_hcf = False flag is preventing).

So for greater flexibility I moved use_hcf to default serializer instead of hardcoding it somewhere else.

About limiting the number of concurrently batches I think is better to do it by number of requests, currently the number of request by batch is 100, however this is an artifact of the current implementation and it may change in the future, if we limit by number of requests we won't need to worry about this changing later.

Currently HCF doesn't allow deleting individual requests, it works on batches - that's why the limit is on batches. Say we want to limit a number of requests between checkpoints. If user set it to 10 then we can't enforce this (because there could be no batches with <= 10 requests, and we need to process the whole batch to be able to delete it). If user sets it to 110 we can enfore it (just process a single batch, it is guaranteed to have less than 100 requests). So we need to validate the number of requests - it must be less than 100. But this is hardcoding a number "100" again. And what if user sets a 110 requests limit and then HCF is changed to return 200 items in batch? The user-set 110 requests limit will become invalid (unlike 5 batches limit which will be always valid regardless of HCF changes).

Algorithm really works on batches, not on individual requests, that's why the limit is on batches. You're right that it is not pretty, but the exact value only affects how much data we loose in case of spider failure, and I think generally changing it (in a reasonable range) doesn't change much. We may implement a HCF_MIN_CONCURRENT_REQUESTS that will serve as a hint for HCF_MAX_CONCURRENT_BATCHES, but I'm not sure it worths it, and the name HCF_MIN_CONCURRENT_REQUESTS would be quite confusing (despite being correct).

@andresp99999
Copy link
Contributor

Hi,

I am sorry for not getting back to you before I've been swamped with other tasks. ( I am posting the reply again, I answered by email and the format was all screwed up)

What do you think about extracting job rescheduling to a different extension? It could be useful for other spiders, and it should allows to remove some code (and configuration options) from HcfMiddleware. I must admit don't quite understand the existing start_job function though. E.g. why is hs_projectid passed to hsclient.start_job while "dummy" + consume_from_slot are passed to oldpanel_project.schedule? It also seems that HS_MAX_LINK feature is roughly equivalent to features of CloseSpider middleware - is it necessary?

The disadvantage I see with using the CloseSpider middleware is that it is not aware of the HCF batches being processed, then when it issues the signal to close the spider a batch can be cut in the middle.

The existing job rescheduling code needs some work, for example it does not work in the new panel, it throws some low level exception that I haven't had time to debug. In any case I think is useful in some cases and I think it would be good to keep it as a non default option. May be good idea to move it to its own middleware, this may be part of the refactoring that section need.

In the current implementation job is rescheduled to a spider's HS project. But spider's HS project it could be different from project at which job must be started. For example, if there are "standard" and "broad" spiders, then they will probably have different scrapy projects (with different settings). If "standard" spider schedules a request for "broad" spiders (via HCF) then they will probably use a single HS project despite being in different scrapy projects (several HS projects are currently unsupported in HcfMiddleware).

As far as I know the most common scenario is a single project with multiple spiders, if we need to have different settings then those are set at the spider level (either in code or with parameters). I would be curious in what case we need to have multiple projects.

Also, please excuse my ignorance - could you please explain when job rescheduling is preferred to a long-running spider with intermediate checkpoints? I see the following benefits of intermediate checkpoints: they

  • looks like more efficient (as there is no spider reloading);
  • allow in-process memory usage;
  • don't have hs_project issue (described above);
  • don't require Panel to work;
  • provide combined resulting data and combined stats without extra work;
  • allow batches to be deleted more often.

Multiple jobs will have multiple results sets that can be more easily consumed. Right now I have a crawl with the following results pipeline:

  1. Each job produce a separate results file.
  2. A script consumes all the results files, and prepare a partial delivery to the customer.
  3. The consumed jobs are tagged so they are not consumed again.

With a long running job it will be harder to do this partial processing since there is no easy way to know which items were already processed. Also if we need to stop the crawl it is simple a matter of disabling the job scheduling and let the current jobs end. In a long running job we need to sent the close signal to the spider but then most likely it will stop on the middle of a batch.

About the slot assignation I've been thinking and I believe the middleware should offer support for the most common cases (and the spider should always be able to override them if needed). I think we can use a policy based approach where the slot policy is defined by a setting with the following values:

DEFAULT. This policy will have the spider writing to a single slot. It can be used in the case where we just need the persistence provided by the HCF without needing to use multiple slots for scaling.

DISTRIBUTED. This policy will have the spider crawling a single domain but writing to multiple slots (number of slots controlled by a setting). This is the case when we are crawling a high performance site and we want to speed up the crawl with multiple instances of the spider. The slots will be assigned using a hash of the url/fingerprint.

BROAD. This policy will have the spider crawling from multiple domains and writing to multiple slots (number of slots controlled by a setting). The slot will be assigned based in a domain, so all the urls from a single domain end in the same slot. This is useful for broad crawls.

It could be argued that with 'slot' argument all of these policies are already supported :) One way scrapylib.hcf may help is just to provide some helper functions for slot generation, e.g.

from scrapylib.hcf inport distributed_slot
# ...
    slot = distributed_slot(url, self.HS_SLOT_COUNT)
    yield Request(url, meta={'hcf_params': {'slot': slot}})

Note that in all cases an spider always reads from a single slot.

The slot argument still requires additional work from the developer, I think we should provide slot functions for the most common cases.

This may be obvious, but I don't quite understand how DISTRIBUTED or BROAD policy will work. Who will start spiders with appropriate incoming slots (esp. in BROAD case when there could be hundreds of slots, and HCF doesn't have an API for slot enumeration)? I can also imagine other policies (e.g. put requests of one kind to slot A and requests of another kind to slot B) - for me it looks like slot assigning could be project-specific. I'm arguing for explict 'slot' passing because of this. Users will have to know exactly what slots are used to be able to start their spiders. Without auto-starting spiders bundling DISTRIBUTED and BROAD policies looks preliminary; with only a single policy (DEFAULT which is supported well) there is no need to go fancy and introduce policies. Could you explain how do you currently use HS_NUMBER_OF_SLOTS?

A DISTRIBUTED crawl would be the case of a big site that we can to crawl faster than a single spider can. In this case we use HS_NUMBER_OF_SLOTS to define how many slots and spiders will be used. Then we start that number of spiders to crawl the site in parallel. The slot for each site is calculated with the formula:

   slot = hash(url) % HS_NUMBER_OF_SLOTS

A BROAD crawl is when we process many sites at the same time, therefore we need to use multiple spiders, in this case we use also HS_NUMBER_OF_SLOTS to define the number of slots and then the slot for each site is calculated with the formula:

    slot = hash(domain) % HS_NUMBER_OF_SLOTS

This assures that all the urls from the same site will go to the same slot, but the user still controls the number of slots.

I think is a good idea to add support to serialize/deserialize requests, I was thinking adding something like this, although I am not sure I like the idea of including the check for 'use_hcf' inside the serialize function. I think that logic should be left inside the calling code and it will decide whether to call serialize or not. The spider can always decide to set it or not at request creation time.

Initially use_hcf was out of request serialization/deserialization. But then I realized that I don't know why should we enforce use_hcf flag in case of custom serialization. Custom serializer could use callback-based routing, or it can just make HCF default. For example, a simplistic implementation of transparent serialization with callback support:

    def hcf_serialize_request(self, request):
        return {
            'fp': request.url,
            'qdata': {'callback_name': request.callback.__name__}
        }

    def hcf_deserialize_request(self, hcf_params, batch_id):
        callback = getattr(self, hcf_params['qdata']['callback_name'])
        return Request(hcf_params['fp'], callback)

Also, use_hcf flag could be misleading if middleware is not active. If spider implements custom serialization then it is not like it became using HCF unintentionally (this is what default use_hcf = False flag is preventing).

So for greater flexibility I moved use_hcf to default serializer instead of hardcoding it somewhere else.

OK, I can go with this.

About limiting the number of concurrently batches I think is better to do it by number of requests, currently the number of request by batch is 100, however this is an artifact of the current implementation and it may change in the future, if we limit by number of requests we won't need to worry about this changing later.

Currently HCF doesn't allow deleting individual requests, it works on batches - that's why the limit is on batches. Say we want to limit a number of requests between checkpoints. If user set it to 10 then we can't enforce this (because there could be no batches with <= 10 requests, and we need to process the whole batch to be able to delete it). If user sets it to 110 we can enfore it (just process a single batch, it is guaranteed to have less than 100 requests). So we need to validate the number of requests - it must be less than 100. But this is hardcoding a number "100" again. And what if user sets a 110 requests limit and then HCF is changed to return 200 items in batch? The user-set 110 requests limit will become invalid (unlike 5 batches limit which will be always valid regardless of HCF changes).

Algorithm really works on batches, not on individual requests, that's why the limit is on batches. You're right that it is not pretty, but the exact value only affects how much data we loose in case of spider failure, and I think generally changing it (in a reasonable range) doesn't change much. We may implement a HCF_MIN_CONCURRENT_REQUESTS that will serve as a hint for HCF_MAX_CONCURRENT_BATCHES, but I'm not sure it worths it, and the name HCF_MIN_CONCURRENT_REQUESTS would be quite confusing (despite being correct).

The way I see MAX_LINKS is similar as how the CloseSpider middleware limits work, right now if we set CLOSESPIDER_ITEMCOUNT=N it does not mean the spider will produce exactly N items but that it will stop when it reaches N (but most likely will produce a few more while the close signal is processed).

Then MAX_LINKS is a similar soft limit, if we set it to 1000, then it will process batches until it reaches 1000 (but depending of the batch size it may process some more). The main advantage of doing this way is that we are not coupled to the HCF batch size, which is currently 100 but it may change down the road. If the batch size changes and we are counting batches the actual number of requests processed by the spider will change without the developer or user noticing (this may or may not be an issue)

Regards

@kmike
Copy link
Member Author

kmike commented Aug 19, 2013

Hi Gilberto,

I failed to provide a feedback for your last comment, sorry about that!

You're right that CloseSpider is not aware of HCF so it is better to have something that is HCF-aware (like limit on number of processed batches or a soft limit on number of processed requests).

I'm still not convinced it is a good idea to have DISTRIBUTED / BROAD / etc. builtin - implementation in this PR doesn't disallow them, and built-in solutions like slot = hash(url) % HS_NUMBER_OF_SLOTS doesn't help much - to be able to make use of them user of library must read and understand middleware's code + start all necessary spiders himself. IMHO writing slot = hash(url) % HS_NUMBER_OF_SLOTS logic manually is easy enough.

I'm also thinking of making Request serialization smarter. This is an utility I used in one of the projects:

def hcf_smart_serialize(spider, request):
    if not request.callback or not hasattr(spider, request.callback.__name__):
        raise ValueError("Unsupported callback")

    callback_name = request.callback.__name__
    fp = request.meta.get('hcf_fp', request.url)
    meta_keys = request.meta.get('hcf_store_meta', [])
    meta = dict((k, request.meta[k]) for k in meta_keys if k in request.meta)
    res = {
        'fp': fp,
        'qdata': {
            'callback_name': callback_name,
            'request_meta': meta,
        }
    }
    if fp != request.url:
        res['qdata']['request_url'] = request.url
    if request.priority:
        res['qdata']['request_priority'] = request.priority

    return res


def hcf_smart_deserialize(spider, hcf_params):
    qdata = hcf_params['qdata']
    callback = getattr(spider, qdata['callback_name'])
    fp = hcf_params['fp']
    priority = qdata.get('request_priority', 0)
    meta = qdata['request_meta']
    if meta:
        meta['hcf_store_meta'] = list(meta.keys())

    url = qdata.get('request_url', fp)
    if url != fp:
        meta['hcf_fp'] = fp

    return Request(url, callback, meta=meta, priority=priority)

The idea is the following: serialize request url, callback and priority automatically; also serialize all meta fields that are listed in meta['hcf_store_meta']. This way for many kinds of requests HCF middleware can be just turned on/off and things will work exactly the same.

@nramirezuy
Copy link
Contributor

@kmike
Copy link
Member Author

kmike commented Aug 19, 2013

@nramirezuy good point, didn't know about reqser.py. I'm not sure we should use it as-is though, because it seems that serializing full "meta" or full request body by default can be too much of overhead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants