Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: RequestQueue.addRequest's forefront option doesn't work #2669

Closed
1 task done
prakashgp opened this issue Sep 13, 2024 · 3 comments · Fixed by #2689
Closed
1 task done

bug: RequestQueue.addRequest's forefront option doesn't work #2669

prakashgp opened this issue Sep 13, 2024 · 3 comments · Fixed by #2689
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@prakashgp
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Add lots of initial urls to the request queue
Once a url is scraped, add 2nd level scraping request with forefront = true.
Records will be pushed to dataset in 2nd level scraping requests which are derived from result of 1st level request.
But due to the bug and large number of initial requests, even if we push 2nd level requests to front of the queue, they still gets scraped at the end. And because of this scraper runs for long time without adding any records to the dataset

Code sample

import { BasicCrawler, RequestQueue, Configuration } from 'crawlee'
import {Actor} from 'apify'

let requestQueue = await RequestQueue.open('CUSTOM', {
	config: new Configuration({persistStorage: false})
})

const crawler = new BasicCrawler({
    requestQueue,
    async requestHandler(context) {
        const { request } = context
        const { url, label } = request
        if (label == 'LEVEL2') {
            await Actor.pushData({url, label}) // this task will be executed at the end of the scraping process, even though `forefront` was set to `true`
        } else {
            // add more requests to scrape on 2nd level, but want to add them to front of the queue
            context.addRequests([{url: 'https://googl.com/', label: 'LEVEL2'}], {forefront: true}) // <<< BUG: Always gets scraped at the end after all 1st level urls are scraped. 
            // if the number of initial urls are too large, then scraper will wait until all urls are scraped to push items to dataset because of this bug.
        }
    }
}, new Configuration({persistStorage: false}))

await requestQueue.addRequests([
    // intial urls to scrape
])

await crawler.run()

Package version

^3.11.3

Node.js version

v20.13.0

Operating system

Mac OS

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

Error: Detected incompatible Crawlee version used by the SDK. User installed 3.11.4-beta.0 but the SDK uses 3.11.3

Other context

No response

@prakashgp prakashgp added the bug Something isn't working. label Sep 13, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 13, 2024
@barjin
Copy link
Contributor

barjin commented Sep 17, 2024

Thank you for your report!

It indeed seems that the forefront option does not work as expected with the MemoryStorage implementation of RequestQueue.

As far as I figured, this is caused by two different problems:

  1. The forefront option is encoded by a negative orderNo value. However, we do not consider this in the listHead and similar methods. This IMO causes the main issue with forefront requests not being handled differently from the regular requests.

for (const storageEntry of queue.requests.values()) {
if (items.length === limit) {
break;
}
// This is set to null when the request has been handled, so we don't need to re-fetch from fs

  1. The RequestQueue batching behavior - the RequestQueue reads the requests in batches of 25 (and then processes those). Once the current batch is done, the RequestQueue reads another batch. This works well for the append-only idea of a queue but breaks once we add "priority" requests - those aren't added to the current batch but to the "cold" queue (actually, due to 1., those are just appended to the end of the queue).

I remember @vladfrangu did most of the work on RQv2 - are my assumptions correct here, or did I miss something? To be honest, I'm not sure how to solve this issue, really 🤷🏽 Maybe just add a special forefront store that stores the forefront-enqueued requests, aside from the actual queue? And add a performance warning to the option's documentation?

@vladfrangu
Copy link
Member

Oof, nice catch. We'll either have to collect all requests, sort, then list (which sounds super inefficient), store requests sorted (with smth like insert-sort) in-memory, or split forefront into its own map and go from there

@B4nan
Copy link
Member

B4nan commented Sep 17, 2024

Tick me if you encountered this issue on the Apify platform

@prakashgp your code does not contain Actor.init() and Actor.exit() calls, so it will use the default memory storage even on the platform. If you add those, you will use the API instead of memory storage, where the forefront option works just fine.

@barjin barjin changed the title forefront option doesn't work when persistStorage is false bug: RequestQueue.addRequest's forefront option doesn't work Sep 18, 2024
barjin added a commit that referenced this issue Oct 1, 2024
…2681)

The `MemoryStorage` queue implementation doesn't respect the `forefront`
enqueue option. This PR fixes and tests this.

This PR is necessary for solving #2669, but not sufficient - the request
batch-reading in `RequestQueue.fetchNextRequest` is the other puzzle
piece.
barjin added a commit that referenced this issue Oct 2, 2024
We neglected the `prolong` and `deleteRequestLock` methods
with #2681 , so these do not respect the `forefront`-enforced request
ordering. This PR fixes this omission.

Prerequisite for #2689 
Related to #2669
@barjin barjin closed this as completed in 03951bd Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants