Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Stop loading of page #3238

Closed
RebliNk17 opened this issue Sep 13, 2018 · 41 comments
Closed

[Feature] Stop loading of page #3238

RebliNk17 opened this issue Sep 13, 2018 · 41 comments
Labels
chromium Issues with Puppeteer-Chromium

Comments

@RebliNk17
Copy link

RebliNk17 commented Sep 13, 2018

In Chrome, there is an option to cancel loading of a page by clicking the X which is replaced by the refresh button when the page is loading.

There are some websites that keep on loading, even after 90s I keep on getting timeout errors.

If there was an option to stop loading the page (like there is in chrome), I would get the content that was already loaded and prevent from puppeteer to throw timeout.

I tried to used page.keyboard.press('Escape'); but with no luck..

Another solution would be to stop loading the page after X ms with something like that:

page.setPageLimitLoadingTime(30000);
which will stop the page from continuing the loading process and return all the data it already got...

Chromium API reference:
https://chromedevtools.github.io/devtools-protocol/tot/Page#method-stopLoading

Tell us about your environment:

Thank you.

** if there already is an option for my proposal I'm sorry, I just couldn't find anything...

@aslushnikov
Copy link
Contributor

@RebliNk17 there's a window.stop method, would it be helpful to you?

await page.evaluate(() => window.stop());

@RebliNk17
Copy link
Author

Correct me if I'm wrong, but I think that the evaluate function is only running after page.goto finish...

anyway. it's not working.

something like this is partially working:
await page._client.send("Page.stopLoading");
but I cannot find a way to tell puppeteer that the page has finished loading...

@RebliNk17
Copy link
Author

RebliNk17 commented Sep 16, 2018

Don't know how and why this code now works:
await page._client.send("Page.stopLoading");

It stops loading the page and returns all the data from goto...

A few days ago it didn't return any data and throw timeout but now it does...

@RebliNk17
Copy link
Author

Sorry, not working as I thought.
When using this flag: networkidle0 or networkidle2 in page.goto I am still getting timeout.

When using 'domcontentloaded' or 'load' I'm not getting all the data from some websites but than Im not getting timeout errors.

@aslushnikov Any thought on how to do it?

I've tried this:
https://github.com/RebliNk17/puppeteer/blob/master/lib/Page.js

But I'm still missing something...

@RebliNk17 RebliNk17 reopened this Sep 16, 2018
@aslushnikov
Copy link
Contributor

@RebliNk17 what do you expect to see when you "stop" loading?

If you just want the navigation promise to not hang, I'd implement stopping somehow like this:

let stopCallback = null;
const stopPromise = new Promise(x => stopCallback = x);

const navigationPromise = Promise.race([
  page.goto(url).catch(e => void e),
  stopPromise
]);

// Do something; once you want to "stop" navigation, call `stopCallback`.
stopCallback();

@RebliNk17
Copy link
Author

@aslushnikov
I'll try to explain better my problem.

What I want is to receive the website HTML content and HTTP requests from the page.goto promise after X seconds passed without Exception even if the page did not finish loading.

Currently, if the page did not finish loading a Timeout exception is thrown and no data (HTML / HTTP requests etc) is returned.

Expected result:
When stopLoading is called, the page will stop all process (Just like when pressing ESC or the X button on a regular browser) and will "display" all the content that has been loaded until that press.

Is it clearer now?
If not, I will create a short video to explains it (English is not my native language)

@aslushnikov
Copy link
Contributor

Is it clearer now?

@RebliNk17 I'm still not sure what's not working.

Expected result:
When stopLoading is called, the page will stop all process (Just like when pressing ESC or the X button on a regular browser) and will "display" all the content that has been loaded until that press.

So the following approach should yield the expected result:

  • the await page._client.send("Page.stopLoading"); will stop page loading, as if you hit "X" in the browser
  • you can get page's content after that with await page.content()
  • you can catch and ignore Timeout exception from page.goto:
await page.goto(url).catch(e => void e), // catch and ignore exception

So what's not working?

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Sep 16, 2018

Maybe page.goto(url, { waitUntil: 'domcontentloaded' }) will suffice?

@RebliNk17
Copy link
Author

RebliNk17 commented Sep 17, 2018

Maybe page.goto(url, { waitUntil: 'domcontentloaded' }) will suffice?

That's not loading all the javascript in the page.

  • the await page._client.send("Page.stopLoading"); will stop page loading, as if you hit "X" in the browser

When using networkidle0 or networkidle2 that's not enough, it will still hang and throw timeout exception.

await page.goto(url).catch(e => void e), // catch and ignore exception

this will still hang until timeout.

what I found to be working is something like this:
I changed the code a little bit in lib\Page.js#goto

  async goto(url, options = {}) {
    ......

    const pageLoadingStoppedFunc = pageLoadingStopped.bind(this);

    let ensureNewDocumentNavigation = false;
    let error = await Promise.race([
      navigate(this._client, url, referrer),
      watcher.timeoutOrTerminationPromise(),
      pageLoadingStoppedFunc()
    ]);
    if (!error) {
      error = await Promise.race([
        watcher.timeoutOrTerminationPromise(),
        ensureNewDocumentNavigation ? watcher.newDocumentNavigationPromise() : watcher.sameDocumentNavigationPromise(),
        pageLoadingStoppedFunc(),
      ]);
    }
    watcher.dispose();
    helper.removeEventListeners(eventListeners);
    if (error)
      throw error;
    const request = requests.get(mainFrame._navigationURL);
    this._finished = true;
    return request ? request.response() : null;
    ...

    /* Not sure if this is the right approch for this function... */
    async function pageLoadingStopped() {
      const _this = this;
      return new Promise(function(resolve, reject) {
        const interval = setInterval(() => {
          if (_this._stopped || _this._finished) {
            clearInterval(interval);
            resolve();
          }
        }, 100);
      });
    }
  }

  async stopPageLoading() {
    await this._client.send('Page.stopLoading');
    this._stopped = true;
  }

this now waits for page loading to finish or loading to stop and not handing at all.

Is it possible to add it to the official puppeteer API?

@RebliNk17
Copy link
Author

RebliNk17 commented Oct 2, 2018

@aslushnikov Any thought on the code I shared above?
It work as expected but not sure if there is a better approach for that...
If it's good, should I create PR?

@aslushnikov
Copy link
Contributor

@RebliNk17 sorry for the delay, I was busy with other stuff.

Any thought on the code I shared above?

Can we step back and re-iterate since I still don't understand what's not working.

If I understand correctly, there's a website that takes a lot of time to load. We want to constrain wait time to certain amount and get content from the page after this time. Is this correct?

If yes, why's the following not working for you?

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  try {
    // Contrain loading time to 30 seconds
    await page.goto('https://bestmodelsbrasil.blogspot.co.il', {waitUntil: 'networkidle0', timeout: 30000});
  } catch (e) {
  }
  console.log(await page.content());
  await browser.close();
})();

@RebliNk17
Copy link
Author

RebliNk17 commented Oct 11, 2018

Sorry, I did not get any notification about your comment.

Your code will work, but sometimes, the timeout might not be a time, it can also depended on different code running in the background, like in my situation.

Adding this "stopPageLoading" which exists in the Chromium API, will make it possible...
It's something that Puppeteer should have...

@RebliNk17
Copy link
Author

@aslushnikov Any thoughts?
I see two people voted this...

@chigix
Copy link

chigix commented Oct 26, 2018

window.stop() is really work!!

Below is my typescript code sample:

page.goto('https://some site slow...');
await new Promise(resolve => setTimeout(resolve, 3000));
await page.evaluate(_ => window.stop());
await browser.close().catch(reason => console.error(reason));

There is actually no unhandled promise rejection from the goto invoke.

@aslushnikov
Copy link
Contributor

Your code will work, but sometimes, the timeout might not be a time, it can also depended on different code running in the background, like in my situation.

Wouldn't the solution in #3238 (comment) work in this case?

@RebliNk17
Copy link
Author

Your code will work, but sometimes, the timeout might not be a time, it can also depended on different code running in the background, like in my situation.

Wouldn't the solution in #3238 (comment) work in this case?

No, it wouldn't work...
I added my code (from that comment) as a patch to v1.8 and it's working great...
I want to upgrade to the latest version with this function available
Is it that hard to implement it in the code?

Again, this is something that exists in the Chromium API... all that is needed it the implementation in puppeteer...

@aslushnikov
Copy link
Contributor

No, it wouldn't work...

Can you explain why?

Again, this is something that exists in the Chromium API... all that is needed it the implementation in puppeteer...

It's very important to understand what we add and why - otherwise we risk to bloat API with one-off solutions.

@RebliNk17
Copy link
Author

No, it wouldn't work...

Can you explain why?

Again, this is something that exists in the Chromium API... all that is needed it the implementation in puppeteer...

It's very important to understand what we add and why - otherwise we risk to bloat API with one-off solutions.

This code is not returning the page in case of the stopPromise wins the race...

@aslushnikov
Copy link
Contributor

This code is not returning the page in case of the stopPromise wins the race...

Well, that's very easy to address:

let stopCallback = null;
const stopPromise = new Promise(x => stopCallback = x);

const navigationPromise = Promise.race([
  page.goto(url).catch(e => void e),
  stopPromise,
]).then(() => page);

// Do something; once you want to "stop" navigation, call `stopCallback`.
stopCallback();

@RebliNk17
Copy link
Author

RebliNk17 commented Nov 14, 2018

When trying to use page.status() getting page.status is not a function
with your code

@RebliNk17
Copy link
Author

@aslushnikov any update?

@aslushnikov
Copy link
Contributor

When trying to use page.status() getting page.status is not a function
with your code

@RebliNk17 There's no status method no a page object: Page API.

I guess what you want is a navigation response, not the page object.
In this case, the following should work:

let stopCallback = null;
const stopPromise = new Promise(x => stopCallback = x);

let navigationRequest = null;

const onRequest = r => {
  if (r.isNavigationRequest())
    navigationRequest = r;
};

page.on('request', onRequest);
const navigationPromise = Promise.race([
  page.goto(url).catch(e => void e),
  stopPromise,
]).then(() => {
  page.removeListener('request', onRequest);
  return navigationRequest.response();
});

// Do something; once you want to "stop" navigation, call `stopCallback`.
stopCallback();

@RebliNk17
Copy link
Author

Sorry, status of page respone. It is exists...
I am currently using it. I will check your code but i am not sure that this is what i am looking for.

@aslushnikov aslushnikov added the chromium Issues with Puppeteer-Chromium label Dec 6, 2018
@aslushnikov
Copy link
Contributor

I'll close this for now - let me know if we can be helpful.

@RebliNk17
Copy link
Author

Hi, Sorry for the delay, I've been on a vacation...

I've tested your code, it's not what I need...
I think the only solution is this

@nylen
Copy link

nylen commented Apr 8, 2019

@aslushnikov I am running into some difficulties here too. I think part of the problem is just that there are a lot of different ways a page load can fail.

I'm building an archiving tool, and I'd like to give page.goto a chance to load a page for a while, and then stop the loading process and inspect what (if anything) is loaded so far.

If the page is partially or mostly loaded, but the browser gets stuck "Connecting..." to the server for a resource in the main rendering path, then page.goto will fail. In this case (and potentially others) we can do better than the default timeout behavior: we want to wait a while and let the browser try to load the page, then stop the loading process and see if we've got any usable content.

Currently I am getting the best results from the combination of the Promise.race technique above, and page._client.send( 'Page.stopLoading' ). This works in more cases than window.stop(). For example, when the request for the main page is still "Connecting...", window.stop() won't work because no DOM has loaded).

One improvement in Puppeteer would be to add a Page.stopLoading() method, and/or make page.goto abort when a Page.stopLoading is sent to the same page, or when the page load is otherwise aborted (user clicks the Stop button, for example). As things stand now, if you stop its page load elsewhere then page.goto will just time out, so some other code is needed to make this possible.

@ccornici
Copy link

ccornici commented Jun 1, 2019

Same issue here, am in need of a "page.stopLoading()"

@RebliNk17
Copy link
Author

RebliNk17 commented Jun 2, 2019

They don't care,
I gave them a sample, but a lot of things changed since then...
I can implement it again, but it's not very smart to adjust the code every time a new version is out and I want to upgrade...

@aslushnikov
Copy link
Contributor

@aslushnikov I am running into some difficulties here too. I think part of the problem is just that there are a lot of different ways a page load can fail.

@nylen can you please help we understand what the problem is? Our previous discussion on the subject with @RebliNk17 has stalled.

You suggest adding a new method: page.stopLoading():

One improvement in Puppeteer would be to add a Page.stopLoading() method, and/or make page.goto abort when a Page.stopLoading is sent to the same page, or when the page load is otherwise aborted (user clicks the Stop button, for example). As things stand now, if you stop its page load elsewhere then page.goto will just time out, so some other code is needed to make this possible.

What would the method do? Why doesn't the workaround work for you?

@nylen
Copy link

nylen commented Jun 3, 2019

@aslushnikov I think page.stopLoading() would just be a convenience method to do page._client.send('Page.stopLoading') behind the scenes.

The Promise.race workaround above is a decent starting point - it got me on the right track so thanks for that. However, it is pretty clunky, and it doesn't actually stop the navigation. This means you could get more requests, a subsequent navigation redirect, etc which may be undesirable depending on the use case.

As said above, I'd also suggest that page.goto should be smart enough to detect (I suppose reject its promise with an appropriate error?) when navigation is stopped elsewhere, for example via 'Page.stopLoading' or maybe if a user clicks the browser's Stop button. This would allow simplifying the code to conditionally stop a page load even further, you'd just need to set a timeout that calls page.stopLoading, and then await page.goto(...) would still work as expected in the "main" code flow.

@aslushnikov
Copy link
Contributor

aslushnikov commented Jun 3, 2019

This means you could get more requests, a subsequent navigation redirect, etc which may be undesirable depending on the use case.

@nylen that's a good point. page.stop would prevent these, but page's javascript - if any - can still cause navigations and requests. Besides javascript, meta redirects might still cause navigations.

I think the biggest problem I have with page.stopLoading() is that it doesn't actually guarantee much.
It does abort all current in-flight requests - but I don't see why this can be useful.

Could it be that you rely on some specific behavior or page.stopLoading that I'm not aware of? Or maybe you can share your specific scenario so that I can feel your pain better?

@nylen
Copy link

nylen commented Jun 3, 2019

My specific use case: I was building a web archiving tool that (ideally) should work with arbitrary pages, and I found there are certain kinds of navigation timeouts that can be avoided or shortened, like when a page is stuck Connecting... to a resource that's in the main rendering path. I think the issue in the OP is similar.

I agree there are other things that could cause navigations after a page is "stopped". I am assuming that "aborting all current in-flight requests" is good enough for my use case, and so far it seems to be working. For this part page._client.send('Page.stopLoading') is fine, but it was a bit hard to track down the correct call. At least now that is documented in this issue.

So I am mostly just looking for potential ways to improve the code of puppeteer users here. Hence the suggestion to make page.goto aware of "navigation aborted" events, because I think this would allow getting rid of the Promise.race in the examples above.

I don't think any of this is particularly urgent. Thanks for all of your work on Puppeteer.

@superryeti
Copy link

superryeti commented Jul 30, 2019

@aslushnikov Thank you for this

@RebliNk17 sorry for the delay, I was busy with other stuff.

Any thought on the code I shared above?

Can we step back and re-iterate since I still don't understand what's not working.

If I understand correctly, there's a website that takes a lot of time to load. We want to constrain wait time to certain amount and get content from the page after this time. Is this correct?

If yes, why's the following not working for you?

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  try {
    // Contrain loading time to 30 seconds
    await page.goto('https://bestmodelsbrasil.blogspot.co.il', {waitUntil: 'networkidle0', timeout: 30000});
  } catch (e) {
  }
  console.log(await page.content());
  await browser.close();
})();

I am using pyppeteer. and had the same problem(couldn't think of a way to get dom and cookies after a timeout). This solved my problem. I can access the DOM with.

await page.content()

and cookies by

await page.cookies()

I don't understand what everyone else is complaining about. Again, Thank you soo much. Saved me a couple of hours.

@sheikalthaf
Copy link

@RebliNk17 what do you expect to see when you "stop" loading?

If you just want the navigation promise to not hang, I'd implement stopping somehow like this:

let stopCallback = null;
const stopPromise = new Promise(x => stopCallback = x);

const navigationPromise = Promise.race([
  page.goto(url).catch(e => void e),
  stopPromise
]);

// Do something; once you want to "stop" navigation, call `stopCallback`.
stopCallback();

I tried your solution and it is working good but when i try to take screenshot i'm getting error

error: Error: Protocol error (Page.captureScreenshot): Unable to capture screenshot
    at Promise (/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (/node_modules/puppeteer/lib/Connection.js:182:12)
    at Page._screenshotTask (/node_modules/puppeteer/lib/Page.js:951:39)
    at process._tickCallback (internal/process/next_tick.js:68:7)
  -- ASYNC --
    at Page.<anonymous> (/node_modules/puppeteer/lib/helper.js:111:15)
    at htmlBrowser (/dist/apps/botminds-browser/main.js:1079:45)
    at process._tickCallback (internal/process/next_tick.js:68:7)

@Mister-Fil
Copy link

Stop page loading and/or something else, this can also close the alert()

await page.keyboard.press('Escape')

If it doesn't work, then duplicate the line several times

await page.keyboard.press('Escape')
await page.keyboard.press('Escape')
await page.keyboard.press('Escape')

@otachkin
Copy link

Can some one help me to stop this page of continuously loading ?

https://mbd.baidu.com/newspage/data/landingpage?s_type=news&dsp=wise&context=%7B%22nid%22%3A%22news_9644758218931914527%22%7D&pageType=1&n_type=1&p_from=-1&rec_src=52

await page.evaluate(() => window.stop());

Not working, puppeteer just stuck.

@wesleyscholl
Copy link

This worked for me:

await page.goto(this.url, { waitUntil: 'domcontentloaded' })

Thanks!

@heaven
Copy link

heaven commented Nov 18, 2023

@aslushnikov The problem is when setting a timeout with page.goto, even when it fails with TimeoutError, the page keeps running in the browser. This slows down the entire process. Sometimes browser.pages() takes 10+ seconds. Working in a Lambda environment leads to unpredictable behavior and various errors.

Whenever time is out and we reach the timeout, it would be great or even awesome to have a way to stop the page immediately. I agree stopping the ongoing requests won't help much most likely but that'd be better than nothing.

Here's an example:

Function Logs
START RequestId: 44a07ce9-8800-4d98-bc5f-fdb236a34202 Version: $LATEST
2023-11-18T19:03:59.438Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Launching the browser
2023-11-18T19:04:01.978Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Creating new incognito context
2023-11-18T19:04:01.980Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Opening new page
2023-11-18T19:04:02.033Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Loading page
2023-11-18T19:04:05.452Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Main frame navigated to:  ...
2023-11-18T19:04:27.043Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Timeout error, skipping the page
2023-11-18T19:04:27.043Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Cleaning up
2023-11-18T19:04:27.050Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	CLEANUP: Loading pages (await browser.pages())
2023-11-18T19:04:47.091Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	CLEANUP: Closing pages (await Promise.all(pages.map(p => p.close().catch(() => {}))))
2023-11-18T19:04:47.111Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	CLEANUP: disabling disconnect event handler (browser.off('disconnected'))
2023-11-18T19:04:47.111Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	CLEANUP: Disconnecting from the browser (await browser.disconnect())
2023-11-18T19:04:47.112Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	CLEANUP: Done
2023-11-18T19:04:47.112Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Gzipping
2023-11-18T19:04:47.113Z	44a07ce9-8800-4d98-bc5f-fdb236a34202	INFO	Responding
END RequestId: 44a07ce9-8800-4d98-bc5f-fdb236a34202
REPORT RequestId: 44a07ce9-8800-4d98-bc5f-fdb236a34202	Duration: 47680.63 ms	Billed Duration: 47681 ms	Memory Size: 1536 MB	Max Memory Used: 662 MB	Init Duration: 972.51 ms

You can see await browser.pages() took 20 seconds. What's worst, with lambda the page can keep running in the browser even after the function is restarted. So the next event starts opening a new page and then that context.newPage() takes an enormous amount of time. The timeout is set to 25 seconds but the job took almost 47.

@kduffie
Copy link

kduffie commented Dec 14, 2023

Our product crawls our customer's website as part of our overall solution. We are using Puppeteer for this and, overall, it works great. But we have the same problem discussed here. We can't know a priori what the appropriate timeout behavior needs to be on any given page or site.

When page.goto throws a TimeoutError, it doesn't necessarily mean that the page is unusable -- but after catching the error we can't access the HttpResponse that is returned when there is no exception. If a new method, page.response(), for example, returned the response object if it is available, we'd be happy. I realize that in some timeout scenarios the response will not be available (such as if the timeout is at the network layer). It may also be a good idea for Puppeteer to emulate a "stop" when it throws an error, but I don't see that I need to be part of that.

So something like the following would be desireable:

const browser = await launch();
const page = await browser.newPage();
try {
  await page.goto(url, {waitUntil: ['load', 'networkidle2'], timeout: 30000});
catch(err) {
  // perhaps check for errors other than timeout
}
const response = await page.response();
if (response) {
  // use various information from the page itself, but also from response, such as url, status, headers
}
// clean up

Kikobeats added a commit to microlinkhq/browserless that referenced this issue Apr 22, 2024
I discovered there is no way to cancel `page.goto` unless you take care about it.

Long issue with several workarounds: puppeteer/puppeteer#3238
Kikobeats added a commit to microlinkhq/browserless that referenced this issue Apr 22, 2024
I discovered there is no way to cancel `page.goto` unless you take care about it.

Long issue with several workarounds: puppeteer/puppeteer#3238
Kikobeats added a commit to microlinkhq/browserless that referenced this issue May 5, 2024
* fix(goto): abort page.goto after timeout

I discovered there is no way to cancel `page.goto` unless you take care about it.

Long issue with several workarounds: puppeteer/puppeteer#3238

* v10.5.0-alpha.0
@Mahmoud-Skafi
Copy link

Mahmoud-Skafi commented Jun 24, 2024

for some reason this works for me:

await page
          .goto(url, { waitUntil: "domcontentloaded", timeout: 3000 })
          .catch((e) => void e);

await new Promise((resolve) => setTimeout(resolve, 3000));
await page.evaluate((_) => window.stop());

thanks for @chigix

@sharee-tech
Copy link

I used .preventDefault() for this:
`const labelElement = await page.getByRole('link', { name: 'See The Lakes' });

await labelElement.evaluate((element) => {
  element.addEventListener('click', (event) => {
    event.preventDefault();  
  });
});

await labelElement.click();`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chromium Issues with Puppeteer-Chromium
Projects
None yet
Development

No branches or pull requests