Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'NoneType' object is not iterable #107

Open
sgttwld opened this issue Jul 9, 2022 · 23 comments
Open

TypeError: 'NoneType' object is not iterable #107

sgttwld opened this issue Jul 9, 2022 · 23 comments

Comments

@sgttwld
Copy link

sgttwld commented Jul 9, 2022

For me the GoogleImageCrawler of icrawler doesn't work anymore. I updated the user agent in crawler.py since that seemed to work in the past, but no luck here. I tried it both on python 3.8 and 3.9 (apple silicon, but shouldn't matter). Again, it worked in the past (like 3-6 months ago).

Even the simple example

from icrawler.builtin import GoogleImageCrawler
searchterm = 'ANY SEARCHTERM'
google_crawler = GoogleImageCrawler(storage={'root_dir': 'test'})
google_crawler.crawl(keyword=searchterm, max_num=1)

gives

...lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File ".../python3.8/site-packages/icrawler-0.6.6-py3.8.egg/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable

Does anyone know how to fix this, or have the same issue in July 2022?

@Kir-1
Copy link

Kir-1 commented Jul 10, 2022

I have the same problem, I tried to roll back to the old version, it does not help. I saw that such a problem was already with this library
i am executing the following code
self.__search_word = 'cat'
self.__count = 10

`google = GoogleImageCrawler(storage={"root_dir": path})
filters = dict(
size='>1024x768',
date=((2020, 1, 1), (2021, 11, 30)))

        google.crawl(keyword=self.__search_word, max_num=self.__count, filters=filters, offset=rnd.randint(0, 500))
    except Exception as _ex:
        logger.error("Something happened when uploading images", _ex)`

at the output I get
2022-07-10 10:36:53,907 - INFO - icrawler.crawler - start crawling...
2022-07-10 10:36:53,907 - INFO - icrawler.crawler - starting 1 feeder threads...
2022-07-10 10:36:53,914 - INFO - feeder - thread feeder-001 exit
2022-07-10 10:36:53,915 - INFO - icrawler.crawler - starting 1 parser threads...
2022-07-10 10:36:53,916 - INFO - icrawler.crawler - starting 1 downloader threads...
2022-07-10 10:36:54,117 - INFO - parser - parsing result page https://www.google.com/search?q=apex&ijn=1&start=150&tbs=isz%3Alt%2Cislt%3Axga%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2020%2Ccd_max%3A11%2F30%2F2021&tbm=isch
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Administrator\PycharmProjects\BPG\venv37\lib\site-packages\icrawler\parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
python-BaseException

@Viachaslau85
Copy link

Viachaslau85 commented Jul 27, 2022

I have the same problem. Works with Bing and Baidu, but does not work with Google. I keep getting the following errors:
2022-07-27 18:52:22,851 - INFO - icrawler.crawler - start crawling... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 feeder threads... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 parser threads... 2022-07-27 18:52:22,853 - INFO - icrawler.crawler - starting 4 downloader threads... 2022-07-27 18:52:23,323 - INFO - parser - parsing result page https://www.google.com/search?q=cat&ijn=0&start=0&tbs=isz%3Al%2Cic%3Aspecific%2Cisc%3Aorange%2Csur%3Afmc%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2017%2Ccd_max%3A11%2F30%2F2017&tbm=isch Exception in thread parser-001: Traceback (most recent call last): File "C:\Python310\lib\threading.py", line 1009, in _bootstrap_inner self.run() File "C:\Python310\lib\threading.py", line 946, in run self._target(*self._args, **self._kwargs) File "C:\Python310\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, **kwargs): TypeError: 'NoneType' object is not iterable 2022-07-27 18:52:27,857 - INFO - downloader - no more download task for thread downloader-001 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-001 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-003 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-003 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-004 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-004 exit 2022-07-27 18:52:27,859 - INFO - downloader - no more download task for thread downloader-002 2022-07-27 18:52:27,859 - INFO - downloader - thread downloader-002 exit 2022-07-27 18:52:27,894 - INFO - icrawler.crawler - Crawling task done!

@feay1234
Copy link

got the same problem.

@dravicenna
Copy link

The same problem

@ZhiyuanChen ZhiyuanChen removed the needs reproduce This needs reproduce label Aug 30, 2022
@philborman
Copy link

The problem is in builtin/google.py replace the parse function around line 148 with this...

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    images = soup.find_all(name='img')
    uris = []
    for img in images:
        if img.has_attr('src'):
            uris.append(img['src'])
    return [{'file_url': uri} for uri in uris]

@Viachaslau85
Copy link

Viachaslau85 commented Dec 15, 2022

The problem is in builtin/google.py replace the parse function around line 148 with this...

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    images = soup.find_all(name='img')
    uris = []
    for img in images:
        if img.has_attr('src'):
            uris.append(img['src'])
    return [{'file_url': uri} for uri in uris]

Much better, but it still doesn't work. It generates errors of the following type:

2022-12-15 09:27:02,994 - ERROR - downloader - Exception caught when downloading file //www.gstatic.com/images/branding/googlelogo/svg/googlelogo_clr_160x56px.svg, error: '', remaining retry times: 2
2022-12-15 09:27:02,996 - ERROR - downloader - Exception caught when downloading file //www.gstatic.com/images/branding/googlelogo/svg/googlelogo_clr_160x56px.svg, error: '', remaining retry times: 1

jfreyberg added a commit to jfreyberg/icrawler that referenced this issue Dec 19, 2022
@jfreyberg
Copy link

I can confirm the comments from @philborman and @Viachaslau85. This particular issue seems to be solved by using the provided code snippet (use my fork to pull the relevant changes: pip install git+git://github.com/jfreyberg/icrawler@master --upgrade) but it yields the new downloading error.

@philborman
Copy link

philborman commented Dec 19, 2022 via email

@jfreyberg
Copy link

Thank you @philborman!
I had to modify it some more for the thing to work:

if 'images/branding/' not in img['src']:
    img_src = img['src']
    if not img_src.startswith('http'):
        img_src = 'https:' + img_src
    uris.append(img_src)

Somehow my URLs lacked the protocol.

I did not properly test this (I can only confirm it worked for google images) so I can not create a pull request, but if anyone wants to use my repo to fix this, feel free: https://github.com/jfreyberg/icrawler

@masa8
Copy link

masa8 commented Mar 29, 2023

Everything was fine until yesterday.
I got same problem today.
I thought a few changes might get it to work, but it didn't.

At least the following code is working.

import requests
from bs4 import BeautifulSoup
import os

def save_images(save_dir, keywords):
  os.makedirs(save_dir, exist_ok=True)
  for keyword in keywords:
      url = f"https://www.google.com/search?q={keyword}&tbm=isch"
      res = requests.get(url)
      soup = BeautifulSoup(res.text, "html.parser")
      img_tags = soup.find_all("img")
      for i, img in enumerate(img_tags):
          try:
              img_url = img["src"]
              res = requests.get(img_url)
              with open(f"{save_dir}/{keyword}{str(i).zfill(5)}.jpg", "wb") as f:
                  f.write(res.content)
          except:
              continue

keywords = ["cat"]
save_dir = "train"
save_images(save_dir, keywords)

So I changed parse method like this:

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')
        image_tags = soup.find_all("img")
        uris = []
        for img in image_tags:
          try:
            img_url = img["src"]
            
            res = requests.get(img_url) # Experiments only
            uris.append(img_url)

          except:
            continue
        print(len(uris)) #Experiments Result goes 0
        return [{'file_url': uri} for uri in uris]

and then run a following code.

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'train'})
google_crawler.crawl(keyword='cat', max_num=1)

but I got nothing.

Environment

icrawler(0.6.6)
Pillow (8.4.0)
six (1.16.0)
lxml(4.9.2)
beautifulsoup4 (4.11.2)
requests (2.27.1)
soupsieve(2.4)
charset-normalizer(2.0.12)
idna(3.4)
certifi (2022.12.7)
urllib3(1.26.15)

@ZhiyuanChen
Copy link
Collaborator

Could any one let me know if it's still persistent on 0.6.7?

@simonmcnair
Copy link

It's still not working for me.

@megayounus786
Copy link

Also same error!
Exception in thread parser-001:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/younus/.local/lib/python3.10/site-packages/icrawler/parser.py", line 94, in worker_exec
for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
2023-07-25 09:04:15,067 - INFO - downloader - no more download task for thread downloader-001
2023-07-25 09:04:15,069 - INFO - downloader - thread downloader-001 exit
2023-07-25 09:04:15,073 - INFO - icrawler.crawler - Crawling task done!

Broke this recently... tried different python versions but still no progress. Please help to fix this.

@bretdavi
Copy link

bretdavi commented Dec 5, 2023

So found the solution, at least for my case, and it has to do with this line here:

if "ds:0" in txt or "ds:1" not in txt:

ds:0 and ds:1 are keys in the AF_initDataCallback data structure, and that's supposed to match key ds:1 and ignore ones that have the ds:0 key.

Problem is that it's just a basic <sub_str> in <str> check, so if an unrelated ds:0 substring appears in that massive block of text, it will skip it, and so it doesn't get parsed. Sure enough, did some debugging, and there was some unrelated string in the block that matched, breaking the parse logic.

Proper solution would be refined regex or perhaps even parsing the actual data structure into a native dict, or something. My lazy solution was to add single quotes around the keys as part of the substring. Much less likely to incorrectly match random other string that way.

if "'ds:0'" in txt or "'ds:1'" not in txt

Again, this is not a robust solution, but a quick hack to get things working 😆

I just created a custom parser class that inherits from the GoogleParser that added that change, and at least worked for me.

So if you're looking for a quick fix, try that. May try to get a more robust fix pushed as a PR, or if someone else wants to take a crack at it.

@bretdavi
Copy link

bretdavi commented Dec 6, 2023

So found the solution, at least for my case, and it has to do with this line here:

if "ds:0" in txt or "ds:1" not in txt:

ds:0 and ds:1 are keys in the AF_initDataCallback data structure, and that's supposed to match key ds:1 and ignore ones that have the ds:0 key.

Problem is that it's just a basic <sub_str> in <str> check, so if an unrelated ds:0 substring appears in that massive block of text, it will skip it, and so it doesn't get parsed. Sure enough, did some debugging, and there was some unrelated string in the block that matched, breaking the parse logic.

Proper solution would be refined regex or perhaps even parsing the actual data structure into a native dict, or something. My lazy solution was to add single quotes around the keys as part of the substring. Much less likely to incorrectly match random other string that way.

if "'ds:0'" in txt or "'ds:1'" not in txt

Again, this is not a robust solution, but a quick hack to get things working 😆

I just created a custom parser class that inherits from the GoogleParser that added that change, and at least worked for me.

So if you're looking for a quick fix, try that. May try to get a more robust fix pushed as a PR, or if someone else wants to take a crack at it.

Well hit the issue again with other searches, so guess that was just 'one' of the issues with it 😆

@bretdavi
Copy link

bretdavi commented Dec 6, 2023

Seemed weird that the failures for this are sporadic for me. Not always failing on the same download. So did more debugging, and noticed that the response data for the failed parsing didn't have the same div elements as what was expected. Noticed some script content referencing XSRF, so that didn't seem like a great sign.

Added some logging to the parser worker_exec function to dump the response and request headers to a log file to see if that showed anything, and sure enough it did.

Noticed that successful requests have no cookie, but every failed request that occurs has a Cookie set in the request and response headers. I cleared all browser cookies for google, and then it seemed to get further, but guess it gets regenerated at some point. The feeder/parser/downloader are all sharing a common session, so maybe that has to do with it.

I'm definitely not an expert on web requests/security stuff, but seems that the cookie is getting generated at some point and it's then flagging the request as XSRF, or something like that. So that's why the parser fails to extract the image urls. Tried some tricks I found online to try and block the session from using cookies, but that didn't work.

Figured I'd post my findings though, as seems that this is likely the main culprit (though the other issue I reported is still valid)

@bretdavi
Copy link

bretdavi commented Dec 7, 2023

So this isn't exactly a solution, but I decided to just switch and try out the BingImageCrawler, and it's been working perfectly fine.

I think it boils down to google more actively trying to block scraping, which leads to the issues I'd mentioned. Bing doesn't seem to care 😆 Probably need some more sophisticated implementation for the GoogleImageCrawler to avoid getting blocked.

Imho, don't see any real difference between using Google or Bing, so I'd just recommend swapping over and giving the other a shot. Again, not a fix, but I'd just avoid the GoogleImageCrawler for now

@Patty-OFurniture
Copy link

There is a log line right above the error "INFO - parser - parsing result page" - this is the URL where the error happened. Copy this URL into a browser and verify it works. And it looks like people have been examining the HTML but not saving/posting it, so it's almost impossible to diagnose. Using a random offset makes it definitely impossible.

You could hack up parser.py:

self.logger.info(f"parsing result page {url}")
self.logger.debug(response.content)

or:

task_list = self.parse(response, **kwargs)
if not task_list:
self.logger.debug("self.parse() returned no tasks")
with open("task_list_error.log", 'ab') as f:
f.write(response.content)
f.write("\n")

That's not a solution, but it would help find the core problem. I found it when I was hacking on the search filters. I sent in garbage and got no results. Logging the URL and actual page, as above, will help.

Sorry about any formatting, the code blocks seem inconsistent.

@axy1976
Copy link

axy1976 commented Apr 8, 2024

I am facing the same issue, Is there any stable solution to this problem?

2024-04-08 10:48:52,455 - INFO - icrawler.crawler - start crawling...
2024-04-08 10:48:52,455 - INFO - icrawler.crawler - starting 1 feeder threads...
2024-04-08 10:48:52,455 - INFO - feeder - thread feeder-001 exit
2024-04-08 10:48:52,455 - INFO - icrawler.crawler - starting 2 parser threads...
2024-04-08 10:48:52,456 - INFO - icrawler.crawler - starting 4 downloader threads...
2024-04-08 10:48:54,083 - INFO - parser - parsing result page https://www.google.com/search?q=Computer+Networks+Advanced&ijn=0&start=0&tbs=sur%3Afmc&tbm=isch
Exception in thread parser-001:
Traceback (most recent call last):
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/site-packages/icrawler/parser.py", line 94, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
2024-04-08 10:48:54,456 - INFO - parser - no more page urls for thread parser-002 to parse
2024-04-08 10:48:54,456 - INFO - parser - thread parser-002 exit
2024-04-08 10:48:57,462 - INFO - downloader - no more download task for thread downloader-001
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-002
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-003
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-003 exit
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-004
2024-04-08 10:48:57,464 - INFO - downloader - thread downloader-004 exit
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-002 exit
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-001 exit
2024-04-08 10:48:57,478 - INFO - icrawler.crawler - Crawling task done!

worked fine for 4 months and suddenly this problem occurred,

@Patty-OFurniture
Copy link

First, "stable" will never happen because the search providers can change their results pages at any time.

Second, when I have hit this error, I created a log file of the actual results. I just had to re-try google and it worked the second time. But the log didn't show anything interesting to be fixed, as far as I've had time to investigate.

And finally, google just last week changed their results page to assume javascript is enabled. And the results are not in the same format. Looks like they are actively fighting against projects like this one but I haven't had time to really dig in. The interesting parts seem to be "encrypted" (their word for the property) somehow. There is a noscript tag with a redirect, but I haven't figured out a good way to insert that back in the queue. But the current logic expects script tags for each image, so it would also have to be updated for the noscript page results.

So all scriptless crawlers are broken until fixed..

@axy1976
Copy link

axy1976 commented Apr 8, 2024

So, stable fixes cannot be possible.

Thank you for the reply. Appreciate your time.

@Patty-OFurniture
Copy link

Patty-OFurniture commented Apr 8, 2024 via email

ZhiyuanChen added a commit to ZhiyuanChen/icrawler that referenced this issue May 15, 2024
ZhiyuanChen added a commit to ZhiyuanChen/icrawler that referenced this issue May 15, 2024
@ZhiyuanChen
Copy link
Collaborator

Please let me know if 0.6.8 fixes this issue~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests