-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: 'NoneType' object is not iterable #107
Comments
I have the same problem, I tried to roll back to the old version, it does not help. I saw that such a problem was already with this library `google = GoogleImageCrawler(storage={"root_dir": path})
at the output I get |
I have the same problem. Works with Bing and Baidu, but does not work with Google. I keep getting the following errors: |
got the same problem. |
The same problem |
The problem is in builtin/google.py replace the parse function around line 148 with this...
|
Much better, but it still doesn't work. It generates errors of the following type: 2022-12-15 09:27:02,994 - ERROR - downloader - Exception caught when downloading file //www.gstatic.com/images/branding/googlelogo/svg/googlelogo_clr_160x56px.svg, error: '', remaining retry times: 2 |
I can confirm the comments from @philborman and @Viachaslau85. This particular issue seems to be solved by using the provided code snippet (use my fork to pull the relevant changes: |
The downloading error is easy to fix,
Just add this line before uris.append
if 'images/branding/' not in img['src']:
…On Mon, 19 Dec 2022, 20:45 Julian Freyberg, ***@***.***> wrote:
I can confirm the comments from @philborman
<https://github.com/philborman> and @Viachaslau85
<https://github.com/Viachaslau85>. This particular issue seems to be
solved by using the provided code snippet (use my fork to pull the relevant
changes: pip install ***@***.***
--upgrade) but it yields the new downloading error.
—
Reply to this email directly, view it on GitHub
<#107 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4YOOIQFMFLBPJXVQ7IY3TWOC3O5ANCNFSM53D4UUDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you @philborman!
Somehow my URLs lacked the protocol. I did not properly test this (I can only confirm it worked for google images) so I can not create a pull request, but if anyone wants to use my repo to fix this, feel free: https://github.com/jfreyberg/icrawler |
Everything was fine until yesterday. At least the following code is working. import requests
from bs4 import BeautifulSoup
import os
def save_images(save_dir, keywords):
os.makedirs(save_dir, exist_ok=True)
for keyword in keywords:
url = f"https://www.google.com/search?q={keyword}&tbm=isch"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
img_tags = soup.find_all("img")
for i, img in enumerate(img_tags):
try:
img_url = img["src"]
res = requests.get(img_url)
with open(f"{save_dir}/{keyword}{str(i).zfill(5)}.jpg", "wb") as f:
f.write(res.content)
except:
continue
keywords = ["cat"]
save_dir = "train"
save_images(save_dir, keywords) So I changed class GoogleParser(Parser):
def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all("img")
uris = []
for img in image_tags:
try:
img_url = img["src"]
res = requests.get(img_url) # Experiments only
uris.append(img_url)
except:
continue
print(len(uris)) #Experiments Result goes 0
return [{'file_url': uri} for uri in uris] and then run a following code. from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(storage={'root_dir': 'train'})
google_crawler.crawl(keyword='cat', max_num=1) but I got nothing. Environmenticrawler(0.6.6) |
Could any one let me know if it's still persistent on 0.6.7? |
It's still not working for me. |
Also same error! Broke this recently... tried different python versions but still no progress. Please help to fix this. |
So found the solution, at least for my case, and it has to do with this line here: icrawler/icrawler/builtin/google.py Line 155 in ad5633c
Problem is that it's just a basic Proper solution would be refined regex or perhaps even parsing the actual data structure into a native dict, or something. My lazy solution was to add single quotes around the keys as part of the substring. Much less likely to incorrectly match random other string that way. if "'ds:0'" in txt or "'ds:1'" not in txt Again, this is not a robust solution, but a quick hack to get things working 😆 I just created a custom parser class that inherits from the GoogleParser that added that change, and at least worked for me. So if you're looking for a quick fix, try that. May try to get a more robust fix pushed as a PR, or if someone else wants to take a crack at it. |
Well hit the issue again with other searches, so guess that was just 'one' of the issues with it 😆 |
Seemed weird that the failures for this are sporadic for me. Not always failing on the same download. So did more debugging, and noticed that the response data for the failed parsing didn't have the same div elements as what was expected. Noticed some script content referencing XSRF, so that didn't seem like a great sign. Added some logging to the parser Noticed that successful requests have no cookie, but every failed request that occurs has a Cookie set in the request and response headers. I cleared all browser cookies for google, and then it seemed to get further, but guess it gets regenerated at some point. The feeder/parser/downloader are all sharing a common session, so maybe that has to do with it. I'm definitely not an expert on web requests/security stuff, but seems that the cookie is getting generated at some point and it's then flagging the request as XSRF, or something like that. So that's why the parser fails to extract the image urls. Tried some tricks I found online to try and block the session from using cookies, but that didn't work. Figured I'd post my findings though, as seems that this is likely the main culprit (though the other issue I reported is still valid) |
So this isn't exactly a solution, but I decided to just switch and try out the BingImageCrawler, and it's been working perfectly fine. I think it boils down to google more actively trying to block scraping, which leads to the issues I'd mentioned. Bing doesn't seem to care 😆 Probably need some more sophisticated implementation for the GoogleImageCrawler to avoid getting blocked. Imho, don't see any real difference between using Google or Bing, so I'd just recommend swapping over and giving the other a shot. Again, not a fix, but I'd just avoid the GoogleImageCrawler for now |
There is a log line right above the error "INFO - parser - parsing result page" - this is the URL where the error happened. Copy this URL into a browser and verify it works. And it looks like people have been examining the HTML but not saving/posting it, so it's almost impossible to diagnose. Using a random offset makes it definitely impossible. You could hack up
or:
That's not a solution, but it would help find the core problem. I found it when I was hacking on the search filters. I sent in garbage and got no results. Logging the URL and actual page, as above, will help. Sorry about any formatting, the code blocks seem inconsistent. |
I am facing the same issue, Is there any stable solution to this problem?
worked fine for 4 months and suddenly this problem occurred, |
First, "stable" will never happen because the search providers can change their results pages at any time. Second, when I have hit this error, I created a log file of the actual results. I just had to re-try google and it worked the second time. But the log didn't show anything interesting to be fixed, as far as I've had time to investigate. And finally, google just last week changed their results page to assume javascript is enabled. And the results are not in the same format. Looks like they are actively fighting against projects like this one but I haven't had time to really dig in. The interesting parts seem to be "encrypted" (their word for the property) somehow. There is a noscript tag with a redirect, but I haven't figured out a good way to insert that back in the queue. But the current logic expects script tags for each image, so it would also have to be updated for the noscript page results. So all scriptless crawlers are broken until fixed.. |
So, stable fixes cannot be possible. Thank you for the reply. Appreciate your time. |
To be clear, the same problem exists for all crawlers. The results could
change at any time.
I use image downloader sometimes, there's a fork in my repos with some
fixes. I think it still works with Google, but it uses Chrome driver to
actually run a browser, for Google results. so I prefer icrawler most of
the time.
…On Mon, Apr 8, 2024, 10:23 AM Akshay Jangir ***@***.***> wrote:
So, stable fixes cannot be possible.
Thank you for the reply. Appreciate your time.
—
Reply to this email directly, view it on GitHub
<#107 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A6JQB6ORSARV4ZEMSFUSGSLY4KR73AVCNFSM53D4UUDKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGI4DSMRRHEZQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Please let me know if 0.6.8 fixes this issue~ |
For me the GoogleImageCrawler of icrawler doesn't work anymore. I updated the user agent in
crawler.py
since that seemed to work in the past, but no luck here. I tried it both on python 3.8 and 3.9 (apple silicon, but shouldn't matter). Again, it worked in the past (like 3-6 months ago).Even the simple example
gives
Does anyone know how to fix this, or have the same issue in July 2022?
The text was updated successfully, but these errors were encountered: