Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Target page, context or browser has been closed #44

Closed
EthanZ1996 opened this issue Dec 27, 2021 · 4 comments
Closed

Error: Target page, context or browser has been closed #44

EthanZ1996 opened this issue Dec 27, 2021 · 4 comments

Comments

@EthanZ1996
Copy link

Hi, elacuesta,

I use your handler in my Scrapy and it runs well and can crawl the information I need. However, some error occurs before the process in item and pipeline. Here is an example:

2021-12-27 16:50:26 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-184' coro=<Route.continue_() done, defined at /home/ethanz/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:710> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ethanz/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 748, in continue_
    await self._async(
  File "/home/ethanz/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 239, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ethanz/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/home/ethanz/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

Sometimes, this error occurs 5 or 6 times per running, but I also met the situation that no errors. In addition, the difference among these errors are the numbers of the task, that is Task-180, Task-181, Task-182, et al.

I guess the error is about the coroutine or asynsio but I am not familiar with them. Do you know what is going on? Do need to change any settings? Thanks! BTW, I am using VM, ubuntu 20.04 on Windows 10.

Regards,
Ethan

@lime-n
Copy link

lime-n commented Jan 6, 2022

I have received the same message when trying to get into the next few pages of a url. I'll provide some further information on my approach here:

I'm building a scraper that goes into the links for each post and then gets the next page, and keeps doing this then finally grabs the info from that page it linked too.

import hashlib
import logging
from pathlib import Path
from typing import Generator, Optional
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.http.response import Response
import scrapy
import logging
from scrapy_playwright.page import PageCoroutine

cookies = {
    'VISITOR_ID': '3c553849d1dc612f60515f04d9316813',
    'INEU': '1',
    'PJBJOBSEEKER': '1',
    'LOCATIONJOBTYPEID': '3079',
    'AnonymousUser': 'MemberId=c3e94bc3-3fcf-423d-bc25-e5a5818cd2b9&IsAnonymous=True',
    'visitorid': '46315422-d835-4a38-b428-4b9c5d6243d3',
    's_fid': '6468EA5E39AF374B-2F7C971BB196D965',
    'sc_vid': '7c12948068d2cb92c1f1622aeaabc62d',
    'listing_page__qualtrics': 'empty',
    'SsaSessionCookie': 'fea6276c-cee9-43b3-9d57-9f00d6bcd32b',
    's_cc': 'true',
    'SessionCookie': '1d7806bc-4b79-47e9-839d-2d94ec224abb',
    'FreshUserTemp': 'https://www.jobsite.co.uk/',
    'bm_mi': '6BF6AA183A047F87BAC664C92ACA8E41~1Fku4TDwEBxz2+fwhUGUWjUhP3vaQED08Ala3VmmARyewb9/OjQUmvPEWw88MUA7USOzt+0MSpdyPmY/3N+iY08InyOy4DnNHgTq88AWwBigf1XhufLstD/eUhUJBgXQRSa1rVlO5SB5mlkhezcDRmv8bL+Gt4NZdsjVC4ZlVc3ptkbKY9cBB65yW2tyjZBLtxsQnz/rFJXo4a9PTKOvF/Betnb8S/XQrpNDsXOojdhtQrrU9V6XSziX+tHXT6xj1osB8XQtm0VGC7L6+4+bgQ==',
    'gpv_pn': '%2FJobSearch%2FResults.aspx',
    'ak_bmsc': '77506B6768E0463D238EEE24AE5B3A72~000000000000000000000000000000~YAAQFsITAnaZtJ99AQAA4LdCLw4PU4xFjE3/FbxxIG7pSjNqX9TClutWaS1MLKKy/9hAM9d6bcEN5Mr9Fbb8+1Jy3rrCsFO5TvxstcVAjaGbbvDCF/mXxeqJQAU1h/cvrZEH68FZyDuslnE+Ae7DuCs1QmNkNP6+0dvA4GT+/MENayQQk8szCo8ch3IfCK1j5/JL+jjbb04pmnpibV3XvUcLeqTJMY1IG9PlTuBIFWF8gXREI+ug2bb8pL+r7T1v1s9gVmfo633B0BoVcXIfWcDgtyFJjFNVayz2lHxUdtnInaWvi1ubzsjQ7cfUDdHTorHsJ0rP1RXB0utZ80GIBNbGdAzd1jkWy9BMIqdIcbBXM4+rCf3fbPw+qui+0Sr4RIxM5N41mvrOQ6W8s9bPR7GySeJr/2HGSmxTjf+4QDVY',
    'TJG-Engage': '1',
    'CONSENTMGR': 'c1:0%7Cc2:0%7Cc3:0%7Cc4:0%7Cc5:0%7Cc6:0%7Cc7:0%7Cc8:0%7Cc9:1%7Cc10:0%7Cc11:0%7Cc12:0%7Cc13:0%7Cc14:0%7Cc15:0%7Cts:1641470409597%7Cconsent:true',
    'utag_main': 'v_id:017e24e57d970023786b817ac51005079001e071009e2_sn:16$_se:5$_ss:0$_st:1641472209641$ses_id:1641470390747%3Bexp-session$_pn:3%3Bexp-session$PersistedFreshUserValue:0.1%3Bexp-session$PersistedClusterId:OTHER--9999%3Bexp-session',
    's_ppvl': '%2FJobSearch%2FResults.aspx%2C13%2C13%2C741%2C409%2C741%2C1600%2C900%2C2%2CL',
    's_ppv': '%2FJobSearch%2FResults.aspx%2C100%2C13%2C6616%2C423%2C741%2C1600%2C900%2C2%2CL',
    's_sq': 'stepstone-jobsite-uk%3D%2526c.%2526a.%2526activitymap.%2526page%253D%25252FJobSearch%25252FResults.aspx%2526link%253DNext%2526region%253Dapp-unifiedResultlist-db2486f4-fb7d-469f-8cfd-f31a3eafb692%2526pageIDType%253D1%2526.activitymap%2526.a%2526.c%2526pid%253D%25252FJobSearch%25252FResults.aspx%2526pidt%253D1%2526oid%253Dhttps%25253A%25252F%25252Fwww.jobsite.co.uk%25252Fjobs%25253Fpage%25253D3%252526action%25253Dpaging_next%2526ot%253DA',
    'bm_sv': '4C178898519D2A4ADEBB840C0B682999~sanqWSDI/ZT0KWrdWhNRc7UtVtqAZ61oPSoLv/MnCD1e0a7vUTSzpggIj9dt/bN4nXEmOaM48hugBFRwdBveJlobrjEcMZ1gHS3S3KXYaHfZPjq6IIf8/Fs1QUlg0s7oLp6DsZbkAWWOnNQiI/uaq7XT7EHnd+n/46ra5jgwfhA=',
    '_abck': '508823E0A454CEF8D6A48101DB66BDB8~0~YAAQFsITAjWdtJ99AQAAr21ELwcAj2rhoIgnvsHOkxQPuREHCA9mDMHsyk68FBhxQ0Jto+6FqaEHJkrrVEUGuYveQAjVJ7CGS+2ajmbcVkG/KIQn8ttCaGvn58jkwzpWm6Fjx4FsLBJyLsceRWSqw5rV2ezEeLrBd/ZToRMpdZop4yqixh5vquandn+h9ysqacaeHPO90VnvctIfvKTUvY5GrrHubGVMkD9/elxRI5whsBdH7ovATyGsLEgYx+e604lY2sQIahSvweclTI4Ud1hTQbSQTebWs52PiYdSU5wq9+YC/7Sr0JuQZCUMyGGqZgtXpfAdc9LDa8X3JfcdO25EZQHxsfEfT/pp7tjbxaXD/pgun9ozymRMy/hBuCj5/Bfln/LzAqOsdDv7q6WVerNr6qivHGDE0m2/~-1~-1~-1',
    'bm_sz': '747D15CEB59AC2C0003BD8479C4BF482~YAAQFsITAjadtJ99AQAAr21ELw5vDH+lMq9NICfxNXHGiXcPcBSrWov2Hy8Y0wgN/OAL7NJWfJ7Lkum/OqG3WNj9/+e8oJhNRQ96ksn+zk0N0gNnoPhUv46am0wktHih1PPfYRlqdPSQSdgE92eHwG3CsFaSeRROKu/1q89aNDH4+JBUk/TDdTmeBqsvJffzvP0S1gAv54dOecx0z2LSW6PEj0e0VtWqmBjFSQxCqH8LZ4r7TwqDpxAKArzWGDMlqR/xZcWvAm8ijUTG+mIuF3N7aBEDGdB90wdyaJGt2CGP3VinhBNtxV7vT8ebY9oWu2rJ+UmugGgJ/dasQP8=~3424837~3354936',
    'EntryUrl': '/jobs?page=3&action=paging_next',
    'SearchResults': '96094529,96094530,96094528,96094527,96094526,96094525,96094524,96094522,96094521,96094520,96094519,96094517,96094518,96094514,96094513,96094509,96094510,96094511,96094508,96094507,96094506,96094503,96094502,96094500,96094499',
}
headers = {
    'authority': 'www.jobsite.co.uk',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
    'accept': 'application/json',
    'content-type': 'application/json',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
    'sec-ch-ua-platform': '"macOS"',
    'origin': 'https://www.jobsite.co.uk',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.jobsite.co.uk/jobs?page=3&action=paging_next',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}


class JobSpider(scrapy.Spider):
    name = 'job_pages'
    start_urls = ['https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance']
    
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
    }
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url = url,
                callback = self.parse,
                dont_filter = True,
                meta= dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_coroutines = [
                        PageCoroutine('wait_for_selector', 'div.row.job-results-row')
                        ]
                )
            )
    def parse(self, response):
       stuff = response.xpath("//div[@class='ResultsSectionContainer-sc-gdhf14-0 kteggz']/div[@class='Wrapper-sc-11673k2-0 gIBPSk']")
       
       for items in stuff:
           for jobs in items.xpath('//article//div//div[position() mod 7 = 6]/a//@href'):
               yield response.follow(
                   jobs, 
                   callback = self.parse_jobs,
                   meta={
                    "playwright": True,
                    "playwright_include_page": True})

       next_page = response.xpath('(//div)[position() mod 5=3][83]/a[2]//@href').get()
       if next_page:
           yield scrapy.Request(
               url = next_page, 
               callback = self.parse,
               meta=dict(
                    playwright= True,
                    playwright_include_page= True,
                    playwright_page_coroutines=[PageCoroutine('wait_for_selector', 'div.row.job-results-row')]
                        
                        )
                            )


    async def parse_jobs(self, response):
        url_sha256 = hashlib.sha256(response.url.encode("utf-8")).hexdigest()
        page = response.meta["playwright_page"]
        await page.screenshot(
            path=Path(__file__).parent / "job_test" / f"{url_sha256}.png", full_page=True
        )
        await page.close()
        yield {
            "url": response.url,
            "title": response.xpath("//h1[@class='brand-font']//text()").get(),
            "price": response.xpath("//li[@class='salary icon']//div//text()").get(),
            "organisation": response.xpath("//a[@id='companyJobsLink']//text()").get(),
            "image": f"job_test/{url_sha256}.png",
        }
if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "CLOSESPIDER_ITEMCOUNT": 100,
            "FEED_URI":'jobs.jl',
            "FEED_FORMAT":'jsonlines',
        }
    )
    process.crawl(JobSpider)
    logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)
    logging.getLogger("scrapy.core.scraper").setLevel(logging.WARNING)
    process.start()

Here's the error output:

    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2022-01-06 15:23:14 [scrapy-playwright] INFO: Closing browser
2022-01-06 15:23:14 [scrapy-playwright] INFO: Closing browser
2022-01-06 15:23:14 [scrapy-playwright] DEBUG: Browser context closed: 'default'

@elacuesta
Copy link
Member

Please, provide a minimal, reproducible example (the provided code sample is hardly minimal).

@lime-n
Copy link

lime-n commented Feb 28, 2022

@elacuesta
It's been a while however I remember clearly that there was an error with my script as opposed to scrapy_playwright. It's a long-shot, but I presume the author of the post likely has a similar issue. In that their script may be problem.

@elacuesta
Copy link
Member

elacuesta commented Mar 27, 2022

Upon closer inspection this seems like a duplicate of #15, which I'm aiming to solve at #74. Feel free to reopen with more information if that's not the case.
I would suggest defining a request errback or a spider middleware with a process_spider_exception method to recover from these errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@elacuesta @lime-n @EthanZ1996 and others