Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot complete download of large photosets - checking existing files requires connection which can fail + slow #63

Open
eggplantedd opened this issue Sep 26, 2022 · 5 comments
Labels

Comments

@eggplantedd
Copy link

Worth opening a separate post for.

This is not about the flickr-api being slow or even that it has errors- just the way the program handles this in my attempt to run it unattended.

The program will occasionally hit HTTP Server Error 500 which forces it to close. Thankfully, upon restart it will check to see if a file has already been downloaded. The issue is how this is handled.

It seems a working connection to the Flickr website is required to check files. I'm guessing it starts the download/connection process with the latest photoset, and then only check if it exists locally.

This leaves the program liable to fail at the checking stage due to API errors, making it very hard to reach the resume download stage with large photosets, as well as being slow.

For example - I am on 17,402 images. On average I'm going to guess it can check 1.6 photos a second, which means 3 hours of no API issues before it can resume downloading.

Using https://github.com/chebum/Supervisor to automatically restart the program, I have left it running all day just to see it stuck checking files.

@eggplantedd eggplantedd changed the title Cannot complete download of large photosets - cannot finish checking existing files quick enough to continue downloading after API error Cannot complete download of large photosets - checking existing files requires connection which can fail Sep 26, 2022
@eggplantedd
Copy link
Author

eggplantedd commented Sep 26, 2022

So the solution- make the checking process for existing files an 'offline' job.

Make the program create a list of the folders and files before starting download (+ file size*) and check them off as it works down them.

When the program is restarted after a 'fatal' API error, you can whizz through this list and resume right where you left off.

This isn't just about knowing where to resume- the whole file checking process is just shifted offline. You could even redownload the list at program restart to validate the previous list. Even if this took a while I would be happy to work this way.

I know you could move files around in the meantime, but the same could be said about the current check process, so no change.

*I don't know if the flickr-api lets you check file size without downloading,

@eggplantedd eggplantedd changed the title Cannot complete download of large photosets - checking existing files requires connection which can fail Cannot complete download of large photosets - checking existing files requires connection which can fail + slow Sep 26, 2022
@beaufour
Copy link
Owner

I've never focused too much on the speed of download here, or downloading massive sets, so it's not surprising that it's not working super well :) Currently the logic is that it gets the list of all photo sets and then starts from one end and downloads each set. The issues are: 1) The Flickr API is quite slow and 2) we need to do an API call for each photo in the set to get the metadata.

I'm not fully understanding your logic change here. To build the list of files and folders, we'd have to call the Flickr API in the first place. And to know if a photo is downloaded we need it's metadata. At least, that's the current logic. Let me have a look at the API returns and see if we can optimize something here.

@eggplantedd
Copy link
Author

eggplantedd commented Sep 26, 2022

Yeaah I thought a couple of hours after, getting a list of files would rely on calling the API in the first place.

So I would think the next best thing is to use a headless web browser to just scrape the folder/photo names and photo links. So just relying on the website being up and no API to foul.

I would be comfortable with comparing folder + file name as the check to see if it's been downloaded, but I'm wondering if some metadata could be scraped off the site- dimensions? You would have to tell me what's used in a check.

I was planning to do a similar thing if required- just feeding the program a list of album links I scraped.

And yes, totally understandable why it might not be working well here. Although getting to about 12,000 photos before really stumbling is pretty impressive!

@beaufour
Copy link
Owner

The slowness is a duplicate of #22 btw. So I'll keep track of it there. The intermittent API errors is a different challenge.

@eggplantedd
Copy link
Author

Yes, I'm wondering if the issue here is actually the fact a 500 error throws you out of the program, rather than going for a re-attempt. It doesn't happen on any particular photo.

@beaufour beaufour added the bug label Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants