Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use addLink in behaviors to crawl additional pages without scope limitation #71

Open
cmillet2127 opened this issue May 8, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@cmillet2127
Copy link

I try to crawl subpages from a main page based on an Xpath expression.

As I can't use window.location.href to crawl additional pages, it throws "Execution context was destroyed". I try to use the ctx.Lib.addLink.
After reading the code of browsertrix-crawler, it seems addLink callback is not set in my case. It seems also, when addLink is set, it is restricted by the scopeType.

Url to crawl : https://group.bnpparibas/toutes-actualites/communique-de-presse

Behavior to crawl additional pages (the first 8 articles)

`
class BnpCommuniquesdePresseBehavior {
static id = "BnpCommuniquesdePresse";

static init() {
	return {
		state: { links: 0 },
		opts: {}
	};
}

static isMatch() {
	return window.location.href === "https://group.bnpparibas/toutes-actualites/communique-de-presse";
}

async *run(ctx) {
	const { getState, awaitLoad, sleep, xpathNodes, addLink } = ctx.Lib;
	
	yield getState(ctx, "BnpCommuniquesdePresseBehavior starting...");
	
	const aTags = Array.from(xpathNodes("//main//div//div//div//div//div//ul/li[position() <= 8]/article/a"));

	if (aTags && aTags.length) {
		yield getState(ctx, aTags.length + " hrefs found");
		for await (const aTag of aTags) {
			await addLink(aTag.href);
			yield getState(ctx, "Add a link to crawl: " + aTag.href, "links");
		}
	}
	else
		yield getState(ctx, "no link found");
	yield getState(ctx, "BnpCommuniquesdePresseBehavior done");
}

}
`

The docker command line
docker run -p 6080:6080 -p 9223:9223 -v c:\tmp\crawls\:/crawls/ -v c:\tmp\custom-behaviors\:/custom-behaviors/ -it webrecorder/browsertrix-crawler:latest crawl --url https://group.bnpparibas/toutes-actualites/communique-de-presse --generateWACZ final-to-warc --text --wait-until domcontentloaded --screenshot thumbnail,view,fullPage --scopeType page --customBehaviors /custom-behaviors/ --pageLimit 10 --screencastPort 9223 --profile "/crawls/profiles/group.bnpparibas.tar.gz" --behaviors siteSpecific

@cmillet2127 cmillet2127 added the enhancement New feature or request label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant