-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate requests being dispatched even with RequestDeduplicationMiddleware in place #36
Comments
Is it possible that multiple instances of same Spider are using same requests?? |
Are these logs from multiple spider runs or are they all from the same run? The My first guess would be that you are dispatching multiple jobs at the same time and they all query the same records from the database. Can you maybe show what the code that dispatches your jobs looks like? |
This is how I am dispatching jobs from a console command. public function handle(): int
{
for ($offset = 1; $offset <= 1000; $offset = $offset + 50) {
dispatch(new ScrapeStoreSocialLinksJob($offset));
}
return 0;
} Below is what my job looks like: public $timeout = 300;
public function __construct(public int $offset)
{}
public function handle()
{
Roach::startSpider(StoreSocialLinksSpider::class, context: ['offset' => $this->offset]);
} These logs are from different RUNs, but from the logs I can see these RUNS start at the same time and end at the same time. I have even tried to |
Can you show what the |
protected function initialRequests(): array
{
return ShopifyStore::query()
->offset($this->context['offset'])
->limit(50)
->get()
->map(function (ShopifyStore $shopifyStore) {
$request = new Request(
'GET',
"https://" . $shopifyStore->url,
[$this, 'parse']
);
return $request->withMeta('store_id', $shopifyStore->id);
})->toArray();
} Behaviour I noticed in the logs:
Below are some stats from the logs
|
This may be a silly question, but does your |
After your comment I went ahead and checked for duplicates in the table. There were indeed some duplicates. Removed them. But problem still happening. Below is my Spider's full source code: <?php
namespace App\Spiders;
use App\Extractors\Stores\AssignCategory;
use App\Extractors\Stores\ExtractContactUsPageLink;
use App\Extractors\Stores\ExtractDescription;
use App\Extractors\Stores\ExtractFacebookProfileLink;
use App\Extractors\Stores\ExtractInstagramProfileLink;
use App\Extractors\Stores\ExtractLinkedInProfileLink;
use App\Extractors\Stores\ExtractTikTokProfileLink;
use App\Extractors\Stores\ExtractTitle;
use App\Extractors\Stores\ExtractTwitterProfileLink;
use App\Models\ShopifyStore;
use App\Processors\SocialLinksDatabaseProcessor;
use Generator;
use Illuminate\Pipeline\Pipeline;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Request;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;
class StoreSocialLinksSpider extends BasicSpider
{
public array $startUrls = [
//
];
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
];
public array $spiderMiddleware = [
//
];
public array $itemProcessors = [
//SocialLinksDatabaseProcessor::class,
];
public array $extensions = [
LoggerExtension::class,
StatsCollectorExtension::class,
];
public int $concurrency = 2;
public int $requestDelay = 1;
/**
* @return Generator<ParseResult>
*/
public function parse(Response $response): Generator
{
$storeData = [
'store_id' => $response->getRequest()->getMeta('store_id')
];
[, $storeData] = app(Pipeline::class)
->send([$response, $storeData])
->through([
ExtractTitle::class,
ExtractDescription::class,
ExtractTwitterProfileLink::class,
ExtractFacebookProfileLink::class,
ExtractInstagramProfileLink::class,
ExtractTikTokProfileLink::class,
ExtractLinkedInProfileLink::class,
ExtractContactUsPageLink::class
])
->thenReturn();
yield $this->item($storeData);
}
protected function initialRequests(): array
{
return ShopifyStore::query()
->offset($this->context['offset'])
->limit(50)
->get()
->map(function (ShopifyStore $shopifyStore) {
$request = new Request(
'GET',
"https://" . $shopifyStore->url,
[$this, 'parse']
);
return $request->withMeta('store_id', $shopifyStore->id);
})->toArray();
}
}
My thinking here is that something going on with Spider's instance and container. |
So my thinking is that the spiders aren't actually sending duplicate requests, but that the extensions (the Logger and StatsCollector, specifically) are reacting to events from different spiders. Couple more questions:
|
|
Hey @ksassnowski , you are right about the second part. In my So your thinking about the extensions like Logger and StatsCollector sounds right to me. |
Just wanted to chime in that I'm experiencing something similar. I have two spiders being executed from a single Laravel Command. Executing one (or the other) results in the StatsCollector outputting expected results. However, if I have both spiders executed, I get a third output of the StatsCollector output that looks like a combination of both. Even if I put a sleep(5) between their execution in the Command, the third, cumulative StatsCollector output occurs... |
I understand why this happens in your case, @code-poel. Assuming your public function handle()
{
Roach::startSpider(MySpider1::class);
Roach::startSpider(MySpider2::class);
} This is because the |
The solution might be to assign every run a unique id and include that as part of the event payload. Then I could scope the events and all corresponding handlers to just that id, even if multiple spiders get started in the same process. I have to check if this can be done without a BC break. |
Yup, that's exactly right. Thanks for the clarification on the root cause! |
This bug has existed for more than 1 year, why hasn't it been fixed by now? |
Because no one has opened a PR yet to fix it. |
Usage environment: roach-php/laravel I also encountered multiple repeated request records in the log, but I am not sure whether it actually happened. In addition, this method can determine whether the request is sent repeatedly or is recorded repeatedly in the log. |
I have a list of URLs in the database, scrapping specific information these URLs.
I have split these URLs in portions of 50 and dispatch a job by giving the offset from database to start from.
Each job gets the 50 URLs from database and spider starts sending requests. 2 concurrent requests with 1 second delay.
At some point it starts sending duplicate requests as can be seen below and Deduplication middleware doesn't report/drop these requests. Not sure what's going on here. Any thoughts?
The text was updated successfully, but these errors were encountered: