-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many errors with broad crawl #15
Comments
Are there any updates on this? I am experiencing a similar issue. I suspect that the cause is in My env:
|
I scraped the first 2K domains from Majestic Million, with In any case, I think this use case would benefit from #6, I have a few ideas but I haven't decided on one just yet. I will resume work on that after #13 is merged and released. Additionally, if you're just taking screenshots I'd suggest to use a |
Hi @elacuesta Thanks for awesome information, I changed to use Maybe I knows my problem with my successful rate, I only used a server with 8 CPU cores with 64 GB RAM to run my script. So I need to decrease Anyway, I'm waiting next release of the package to fix the |
Hi @elacuesta I crawled the first 2k domains from Majestic, the script worked as described above. However, I increased scrape about 10k domains (I don't change any settings), the script only got about 2400-2500 results and stop, the debug logs contained
Do you guess why? and how to fix the bug? My main Playwright code:
|
One contributor of playwright said reusing the browser context may lead the error |
Hi @elacuesta could you estimate when |
Can confirm getting the latter bug with Task exception was never retrieved, just from scraping a single website. Everything seems to work and I get the scraped items so it doesn't look like hurts anything. It only happens once in a while, not every page. I am running each spider in a separate process using celery. Any way to supress the error? [2021-07-05 22:42:30,646: DEBUG/ForkPoolWorker-100] Browser context closed: '1' |
A quick update from my side on this topic. I tried using scrapy-playwright built from #13 to see if this would solve the memory issue (mentioned in microsoft/playwright#6319). The issue is still that after a few hundred scraped items the Unfortunately, it didn't help. What I changed was that I'm using a new context for each logical group of items on the website I am scraping. Previously, my spider would just process all items in a single context. The page would be closed directly after it was used in the callback. I am only scraping one website. I suspect that adding more context actually caused the memory issue to surface quicker - since I now scrape fewer items before none of the browser contexts work, nor new ones are added. Unfortunately, I don't have more details to share. I don't think this is very helpful for the investigation of this, but I wanted to share my findings here. Hopefully, microsoft/playwright#6319 will lead to some better memory management and this won't be an issue in the future. |
Hi @Obeyed thanks for share your test. I also tested on |
It's been a while, but I think I understand what's happening now: #74. |
I have the same issue, but finally I yield different pages in different contexts by context's name ( 'playwright_context': 'xxx' ), it works |
Hello,
I'm using scrapy-playwright package to capture screenshot and get html content of 2000 websites, my main code looks simple:
There are many errors when I ran the script, I change
CONCURRENT_REQUESTS
from30
to1
but the results was no different.My test included 2000 websites, but the Scrapy script scraped only 511 results (about 25% successful rate) and the script is running without more results and error logs.
Please guide me to fix this, thanks in advance,
My error logs:
My Scrapy settings looks like:
My env:
The text was updated successfully, but these errors were encountered: