-
Notifications
You must be signed in to change notification settings - Fork 130
[Feature Request] Avoiding Duplicates with Global Database #151
Comments
Good idea. +1. I'd also like to see an app that can scan local content (images, videos) and then turn to Tumblr.com to seek _raw and/or _1280 files. So that the feature would scan local content and then try to locate raw or HD versions from Tumblr.com. |
I would donate to the team if mine and your idea are considered for development. I have huge (2.5TB) downloaded content from other Tumblr downloader. If I could import them in TumblThree, redownload them from 1280 to raw, and wipe duplicates across all blogs, and this I could move from one drive to a bigger 8Tb drive without redownloading the whole insane structure again, that is totally getting my money. Just saying. |
That would be so good! Hmm I wonder if Google Reverse Image Search and Yandex Reverse Image search APIs could be featured... Hmm. Perhaps another project :) |
So to summarize:
|
How things work right now is that each blog has a simple database (just a class stored in json format) that contains all the image/video/audio names/post id for text posts once downloaded. If TumblThree downloads a post for a blog it first checks in this database. If the post is already there, then it skips the download, otherwise it will request the file size from the file to download from the webserver. Then it checks if there is already a file with that name and continues the file. Thus if there is a image/video/audio file in the blogs download folder but not in the database, TumblThree just adds the file to it's database as completed since nothing can be resumed. If it can be resume because the download was stopped, the file is resumed and then acknowledged as completed and stored in the database. If there is no file at all, it will completely download it. In principle we could either check all blogs databases instead or add a global database. The global database has the advantage that you could also remove blogs from TumblThree, but the file will not be downloaded again. Maybe be should use a real database then a simple class stored in memory, I don't know. Some performance testing would be necessary since that database is likely to contain several millions of entries.. |
Oh I forgot to mention, I'm not a huge fan of a (hard) link approach. Since I'm a Linux user myself and you encounter links more frequently in a POSIX OS, once you move anything out of your file structure, you end up with a mess. Since we're talking about several hundreds of thousands of files (and possible links), I honestly think this isn't the right thing to do. People move things if the space is running out, hard disk capacity getting larger and larger, .. thus this will surely result in problems. I'm almost certain about it. |
Overall it's a nice idea and was requested several times already. It should be more or less straight forward to add since the code is rather clean and I can already imagine where you could easily extend it to add this feature. |
Hello, First I apologize that I cross posted a week ago, I was just asking whether the database you were talking about was the same that could facilitate my request. I was out of bounds and maybe came up pushy. Now on your explanation, it makes sense to add global database and make "download duplicates" an option, on or off by default depending how you feel about it. Or, you could make global database made on each run by collecting local databases. Each approach has its pros. A global database makes it easier to read and write I guess, but is monolithic and can't move folders in and out easily. A runtime generated database takes a while to process (i.e. read 1000+ text files, as in my case) but will allow folders and indexes to be moved around, added and removed to storages easily. I am in favor of the second approach. Hardlinks is something the "other" tumblr downloader I use right now does, and it has positives and negatives. A positive is that when user goes to a folder, he will see all images there even if they do not eat space. Negatives are plenty - from bloating the archive when moving files from one storage to another (hardlinks I think convert to files) to slowing Explorer when reading a folder of 10,000+ files. If you don't feel like adding hardlinks, it's totally OK as long as we can avoid duplicates. |
I like the idea of a global database, it would be very good if the program does not download files with the same filename if they have already been downloaded from other blogs. If the global database will be implemented in the program, I would also like to see the following two options:
|
this would make things on my end a billion times better, i have over 2k blogs and have an ungodly amount of duplicates that my duplicate file finder is running almost constantly just to remove them all. |
- Implements an optionally global file existence check before downloading (#151). The implementation loads all _files databases into the memory and checks if any of them already contains the file. - Improves inlined photo and video detection in the hidden tumblr crawler.
- Implements an optionally global file existence check before downloading (#151). The implementation loads all _files databases into the memory and checks if any of them already contains the file. - Updates Chinese translation.
- Implements an optionally global file existence check before downloading (#151). The implementation loads all _files databases into the memory and checks if any of them already contains the file. - Updates Chinese translation.
- Implements an optionally global file existence check before downloading (#151). The implementation loads all _files databases into the memory and checks if any of them already contains the file. - Updates Chinese translation.
So, after a few weeks, does it help actually to compare the filenames across all loaded tumblrs, or are there still plenty of duplicates with different names, or does it work at all? |
there are still a few duplicates that slip by but it a lot better. i do still have to delete the blog info in the index and re-add the blogs every time a close the program tho, iv tried leaving the program on over night to load everything but it still locks up when i download a blog.
On Friday, January 5, 2018, 9:12:17 AM HAST, Johannes Meyer zum Alten Borgloh <[email protected]> wrote:
So, after a few weeks, does it help actually to compare the filenames across all loaded tumblrs, or are there still plenty of duplicates with different names, or does it work at all?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Maybe it just seems to lock up, but it's actually doing comparisons. Have you enabled the "force recheck"-option? With the global database, you kill some parallelism since I'm using a global lock for file checking. Otherwise it could be possible that two different tasks (blogs) happen to download the same file at the same moment. The crawler always adds every post from the whole blog. It doesn't do any preliminary existence checking. Thus if the blog has 250.000 posts and you've already downloaded 240.000, you'll have to wait until the downloader has compared the 240.000 posts you've already downloaded to all other posts in TumblThree. If you disable the "force recheck"-option the crawler will stop at the first post (based on the post id) that was already downloaded. Since it starts with the newest ones, only newly posted posts will end up the "comparison"-queue. That's what you want in this case. But either way, adding a "skipping already downloaded file .."-progress update is only a few lines of code. I'll add that. Thanks! |
Ok, I've used TumblThree for myself now for two weeks and there are still duplicates with different tumblr hashes. That's unfortunate because the current implementation allows to filter them before the actual download. But since the tumblr hash can be different for the same file, if we want to detect duplicates, we need to hash them by ourselves. That's not a big issue, hashing is a common task and already implemented in .NET, thus it's probably just a few couple of lines. We can add the hashes then to the database or/and generate a global database as well with just the hashes. The downside is -- as already mentioned -- we need to download the file (twice), then generate the hash, then delete the duplicate. I'll try to implement this once I've some spare time. Probably at sometime in March. |
Hello,
I am currently using another Github downloader project (won't name it to avoid conflict of interest, also it is almost abandoned at this point) that has a very important feature - it spends time to index all downloaded content first, then downloads only unqiue files, and when finding duplicates it creates a hardlink.
I am unaware if this app has this feature - please excuse me if the answer is yes - but if not, it is a huge boon for people who download tons of tumblr sites that share similar content.
I am currently caching ~1000 tumblr blogs on 4Tb of drive. I would consider switching to your app if this feature is available, and would pay for it if needed.
Thanks.
The text was updated successfully, but these errors were encountered: