You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.
Issue
Here's the observed behavior. I ran the crawler asking for 50 google images:
Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.
Diagnosing
So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.
ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:
file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.
I didn't trace where self.fetched_num is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generates fetched_num, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).
Output
Here's the dict logged from modified get_filename.
I found where fetched_num is set, it does use a lock. I think the problem may be that a new lock instance is created for each threadpool object. See thread_pool.py :
So each threadpool instance has its own separate lock object. Downloader subclasses ThreadPool and inherits this lock. This could be the problem, if each thread is a separate ThreadPool object. Pool usually implies a manager that handles all the threads, but maybe that's not how the Pool works. Each downloader thread needs to access the same lock object for proper locking.
I haven't traced through how Downloader objects are made. Is there only one Downloader object per crawl () with multiple threads accessing that one Downloader? It doesn't seem so, because that would cause problems with other instance vars. But perhaps.
If one call to crawl () does spawn multiple Downloader objects (one for each thread) then self.lock needs to be a shared object. For instance (simple dumb way) - make a global var dl_lock = Lock () and have Downloader objects use that. If that's not fine grained enough, you can create a new lock () object in each crawl () method and pass it to Downloader objects in init. Or whatever.
I haven't traced through how thread spawning works in icrawler so maybe I'm wrong. But that looks like it could be the problem.
I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.
Issue
Here's the observed behavior. I ran the crawler asking for 50 google images:
Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.
Diagnosing
So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.
ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:
file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.
I didn't trace where
self.fetched_num
is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generatesfetched_num
, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).Output
Here's the dict logged from modified get_filename.
Hope this helps.
The text was updated successfully, but these errors were encountered: