Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

You cant run multiple spiders #249

Open
blackhood5678 opened this issue May 24, 2024 · 3 comments
Open

You cant run multiple spiders #249

blackhood5678 opened this issue May 24, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@blackhood5678
Copy link

blackhood5678 commented May 24, 2024

If you have build multiple spiders and try to run them together it creates concurency issues.
Assume you have a list of spider classes
foreach ($spiders as $spider) {
Roach::startSpider($spider);
}
I would expect each spider to be run once however the first spider is run once, the second 2 times, the 3rd spider 3 times and so on....
Maybe Im doing something stupid or maybe this isnt how im suppose to run multiple spiders idk what the isssue is but ive been looking at the code and I have a suspision it has to do with how the engine starts new runbut im not sure.

Package versions

  • core: [3.0.0]
@blackhood5678 blackhood5678 added the bug Something isn't working label May 24, 2024
@claytongray
Copy link

claytongray commented Jun 27, 2024

I have the same issue. I'm running 1 spider inside of a foreach loop. Because of that, I see multiple, duplicate requests being made. So by the 10th loop, I have 10 spiders running. Those 10 spiders seem to have the corresponding index amount of startUrls. So on the 10th loop, I have 10 spiders, spider 1 of 10 requests the link once, spider 2 of 10 requests the link twice, the third spider requests the link 3 times, etc etc. Even the RequestDeduplicationMiddleware doesn't seem to do anything.

I noticed if I start 2 different Spider classes, even with 2 separate set of URLs, multiple requests are made. So it seems every time a Roach::startSpider() is called, a new spider is created, but will listen to any overrides, such as startUrls.

@joelmellon
Copy link

joelmellon commented Sep 1, 2024

Duplicate/related: #36

Work around is to run spider jobs separately via Laravel queues or another way of "forking" into different PHP processes.

@mattheobjornson
Copy link

Quick&dirty workaround/hack I've used was:

  1. patch Roach.php (add function to Roach class):
    public static function killSpider(): void
    {
        self::$container = null;
    }
  2. Then after each spider run, call it:
    Roach::startSpider(FirstSpider::class);
    Roach::killSpider();
    Roach::startSpider(SecondSpider::class);
    Roach::killSpider();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants