bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

prakashgp · 2024-09-13T12:50:17Z

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Add lots of initial urls to the request queue
Once a url is scraped, add 2nd level scraping request with forefront = true.
Records will be pushed to dataset in 2nd level scraping requests which are derived from result of 1st level request.
But due to the bug and large number of initial requests, even if we push 2nd level requests to front of the queue, they still gets scraped at the end. And because of this scraper runs for long time without adding any records to the dataset

Code sample

import { BasicCrawler, RequestQueue, Configuration } from 'crawlee'
import {Actor} from 'apify'

let requestQueue = await RequestQueue.open('CUSTOM', {
	config: new Configuration({persistStorage: false})
})

const crawler = new BasicCrawler({
    requestQueue,
    async requestHandler(context) {
        const { request } = context
        const { url, label } = request
        if (label == 'LEVEL2') {
            await Actor.pushData({url, label}) // this task will be executed at the end of the scraping process, even though `forefront` was set to `true`
        } else {
            // add more requests to scrape on 2nd level, but want to add them to front of the queue
            context.addRequests([{url: 'https://googl.com/', label: 'LEVEL2'}], {forefront: true}) // <<< BUG: Always gets scraped at the end after all 1st level urls are scraped. 
            // if the number of initial urls are too large, then scraper will wait until all urls are scraped to push items to dataset because of this bug.
        }
    }
}, new Configuration({persistStorage: false}))

await requestQueue.addRequests([
    // intial urls to scrape
])

await crawler.run()

Package version

^3.11.3

Node.js version

v20.13.0

Operating system

Mac OS

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

Error: Detected incompatible Crawlee version used by the SDK. User installed 3.11.4-beta.0 but the SDK uses 3.11.3

Other context

No response

The text was updated successfully, but these errors were encountered:

barjin · 2024-09-17T11:26:57Z

Thank you for your report!

It indeed seems that the forefront option does not work as expected with the MemoryStorage implementation of RequestQueue.

As far as I figured, this is caused by two different problems:

The forefront option is encoded by a negative orderNo value. However, we do not consider this in the listHead and similar methods. This IMO causes the main issue with forefront requests not being handled differently from the regular requests.

crawlee/packages/memory-storage/src/resource-clients/request-queue.ts

Lines 220 to 225 in 2fa8a29

    
           for (const storageEntry of queue.requests.values()) { 
        
               if (items.length === limit) { 
        
                   break; 
        
               } 
        
               // This is set to null when the request has been handled, so we don't need to re-fetch from fs

The RequestQueue batching behavior - the RequestQueue reads the requests in batches of 25 (and then processes those). Once the current batch is done, the RequestQueue reads another batch. This works well for the append-only idea of a queue but breaks once we add "priority" requests - those aren't added to the current batch but to the "cold" queue (actually, due to 1., those are just appended to the end of the queue).

I remember @vladfrangu did most of the work on RQv2 - are my assumptions correct here, or did I miss something? To be honest, I'm not sure how to solve this issue, really 🤷🏽 Maybe just add a special forefront store that stores the forefront-enqueued requests, aside from the actual queue? And add a performance warning to the option's documentation?

vladfrangu · 2024-09-17T11:34:23Z

Oof, nice catch. We'll either have to collect all requests, sort, then list (which sounds super inefficient), store requests sorted (with smth like insert-sort) in-memory, or split forefront into its own map and go from there

B4nan · 2024-09-17T11:43:47Z

Tick me if you encountered this issue on the Apify platform

@prakashgp your code does not contain Actor.init() and Actor.exit() calls, so it will use the default memory storage even on the platform. If you add those, you will use the API instead of memory storage, where the forefront option works just fine.

…2681) The `MemoryStorage` queue implementation doesn't respect the `forefront` enqueue option. This PR fixes and tests this. This PR is necessary for solving #2669, but not sufficient - the request batch-reading in `RequestQueue.fetchNextRequest` is the other puzzle piece.

We neglected the `prolong` and `deleteRequestLock` methods with #2681 , so these do not respect the `forefront`-enforced request ordering. This PR fixes this omission. Prerequisite for #2689 Related to #2669

prakashgp added the bug Something isn't working. label Sep 13, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 13, 2024

B4nan assigned barjin Sep 16, 2024

barjin changed the title ~~forefront option doesn't work when persistStorage is false~~ bug: RequestQueue.addRequest's forefront option doesn't work Sep 18, 2024

barjin mentioned this issue Sep 26, 2024

fix: respect forefront option in MemoryStorage's RequestQueue #2681

Merged

This was referenced Oct 1, 2024

fix: forefront request fetching in RQv2 #2689

Merged

fix: prolong- and deleteRequestLock forefront option #2690

Merged

barjin closed this as completed in #2689 Oct 3, 2024

barjin closed this as completed in 03951bd Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

prakashgp commented Sep 13, 2024

barjin commented Sep 17, 2024

vladfrangu commented Sep 17, 2024

B4nan commented Sep 17, 2024

bug: RequestQueue.addRequest's forefront option doesn't work #2669

bug: RequestQueue.addRequest's forefront option doesn't work #2669

Comments

prakashgp commented Sep 13, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

barjin commented Sep 17, 2024

vladfrangu commented Sep 17, 2024

B4nan commented Sep 17, 2024

bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

I have tested this on the `next` release