-
Notifications
You must be signed in to change notification settings - Fork 130
[test] Help testing the upcoming release! #179
Comments
Hello, Thank you very much for considering my request and implementing this. I see a positive and several drawbacks with this implementation. Pros:
Cons:
IMHO the best solution would be to load all indexes in memory, and apply them in real time against scanned blog contents. I would prefer if every blog had a setting: [ ] Skip other tumblog content However I also see the appeal of your implementation and would not suggest that you remove it - it has great potential! I will be testing it over the next week, and let you know if I encounter issues. |
I can add this. It's not hard to do and is actually how it was before this release. It certainly consumes more memory, but might not be that much of an issue with our systems nowadays and of course somewhat scales with the usage. Download speedwise it should certainly be the best option. Since the old releases also leaked some memory I've no real comparison whatsoever. If you are going to test this, I'll add it later this week. I'd like someone to test it who has downloaded way more content than I have. |
Thank you! I will definitely commit to testing this with a setup of 100+ blogs. I will test both the OP "merged directories" method and the latest proposal, when added. I also started to like the "merged directories" functionality a LOT, so please really don't remove it. |
Downloading images works fine. But the program does not download image metadata (in format images.txt), does this mean that this function has disappeared completely from the program? There are xml files here, do they replace this function? I looked at xml files inside, they contain more data, although for me useful information among them is actually not much more than it was in the previous metadata files (I found useful among the new data only the note count), there is more unnecessary information (like avatar urls and links to smaller files) and just a lot of redundant text. |
It actually was a regression. I've messed up the refactoring that broke downloading of all text posts. I've brought it back in the -a release that I've uploaded in the entry post of this issue. I think the dump crawler data makes only sense if you use the svc branch. I've uploaded a v1.0.7.40a release that you might want to test. The .json files the svc crawler generate look similar to this (100 posts). |
Downloaded v1.0.7.40a release, image metadata are downloaded normally, thanks! I noticed that if enable the option to download files by specific size (1280), then the program for one file downloads json files in at least two versions (250/400/540 and 1280). And if there are files from the photosets, then for them the program creates files with different counters, but with the same content inside. And here there are also other files with names consisting only of digits, they also repeat the content from the files tumblr _ * .json. In general, if in the Total Comember run a search by same size and content, then the program finds many duplicate files. |
Hello, Johannes I just downloaded the new version and found that I can not add more than 50-60 blogs from the clipboard at a time if I copy a lot more addresses. First blogs are added, and then this process stops completely and further blogs are not added. It seems to me that the reason is not in any particular blog, but simply the new version does not process than 50-60 blogs. Here is a list of 3500 blogs, you can see if there is a problem on other machines if you copy the entire list completely, and then from this program adds only the first 50-60 blogs and nothing more. |
Ah, I see. I can already imagine what the problem is. I'll fix it right away. Thanks for testing! Edit: Should be fixed. |
Using 1.0.8.33, Windows 10 I have a suspicion that the new version crashes on attempt to add 18+ blogs. I add blogs directly via entry field in the app, and the application crashes. If I use the clipboard, then it doesn't catch the blog. Tried the same with kitten blogs, and not a single one caused such issue which makes me suspect that this may be related to NSFW tags. Here is one fairly tame 18+ blog http://torikev.tumblr.com/ that crashes every time on attempt to add. EDIT: Seems to be random thing, maybe not 18+ related. Many blogs work, others (such as the above) cause crash every time. |
I downloaded the new version 1.0.8.33d, the program already adds more blogs, but not all. Of 3500 blogs (most of them are exactly online) the program added only 747 blogs. When I cleared the list of 3500 blogs from the already added 747 blogs and copied the remaining part of the previously missed addresses, the program added only a part of them (288 blogs of 2751 blogs that the program previously missed). On the third time, when I cleared the list of addresses from successfully added blogs and copied it, the program again added only part of them. Thus, the program still does not cope completely with the addition of all addresses in the program. update. |
Ah, looks like there were two issues with the new detection. The ones that aren't added in your 20 blogs text file example are safe mode blogs. |
If you want to test, then here is a list of 2608 blogs, all the metadata from them were downloaded in the previous version a few days ago, so they are all exactly online. Do not use list of 3500 blogs, it's my old list and some blogs are probably already offline. |
With the new version, adding blogs has really improved! I see that all blogs from the list have been added. Very good! Thanks! |
Thanks for the quick response! Using 33e version, all blogs that crashed seem to resolve properly now and get catched by the clipboard service. No crashes so far. |
Has anyone encountered a very long pause after a blog seems to be downloaded completely and has spent a long time evaluating >100,000 tumblr posts, without proceeding to the next one? |
ya, after a blog has finished downloading it stalls on that last download and wont move to the next blog. |
It's too general. I've had it briefly tested and added these 3 small blogs, queued them all, enabled all options (with/without load all databases) and they all finished downloading: http://wallpaperfx.tumblr.com/ So, what options did you enable? Can you specify a specific a blog? I'll probably not code anything until next year. So if it always happens for you, I'll remove the latest releases and see if I can fix it next year. |
i had global database enabled, images,linked and reblogged selected. would it also be possible to have an option to load all blog database into one single file. i delete the folder after a blog goes offline because i use the folder names to generate a list to add blogs back the to programs blog list when they dissapear due to crashes ect. |
Hello. I had global database enabled, images + videos, included reblogs, set a folder (shared with 1 other blog) and minimum date 20171101 with no maximum date. Forced rescan is on. After the minimum date was matched, the blog went to collect html of all blog posts without downloading them (probably because of forced rescan). Then it stalled at reaching 48500 of 48474 posts. |
Well, I cannot reproduce any error, stall after complete crawl or hang after shutdown. I've added 400 and more blogs, downloaded 15000 and more posts, turned on/off global database with otherwise default settings, shutdown TumblThree during the crawl, added blogs during the crawl -- it behaves exactly like version v1.0.8.32 for me. There was someone on my website who has apparently the same issue, but I am unable to fix it if I cannot reproduce it. If someone here can provide a more detailed description, or is capable of programming and can debug the issue himself that would be wonderful. |
I have image,video,reblog,linked and force rescan enabled, I also have the global option selected. The blogs will begin to scan like normal with no apparent issues but will seemingly stall at the last file, no matter how long I leave it the crawl is never completed, no error message or crash or anything, it just wont complete the blog crawl, if I press stop nothing seems to happen, the blog just continues to be unsuccessfully downloaded, and the thing is when you re-released this new update it was downloading fine from what I could tell, but after shutting the program down for the night and restarting it the next day is when it began giving me issues again, deleting app data and redownloading the program doesn’t seem to change anything. I didn’t change or do anything different between that time, it just randomly decided it would no longer work. |
Could you upload the database files (_files.tumblr + .tumblr or similar) for a blog that stalls and your settings from C:\Users\AppData\Local\TumblThree\Settings? |
iv also noticed that on top of the previous symptoms if i whipe the blog list and add blogs and directly crawl them without closing the program everything downloads normal, it will reference and block re downloading from the global database. but if i close the program and reopen and then crawl the blog whether it was an old blog or a newly added one that i haven't done anything with the crawler will download all of the content even if its repeated files, so its ignoring the global database and then hangs on the final file in the blog and remain incomplete. altho i can remove the blog from the index but keep the _files files and readd the blog to the crawl list and crawl without closing the program and it will work appropriately and remember the files that were downloaded from the failed crawl. so if i just keep deleting the blog list and leaving the files and reloading them into9t he blog list i am able to download things normally, but only if i constantly reload the blog list every time i open the program and sit though it loading the blogs back into the program. |
Thanks for the description! Based on it, I've an idea what the problem might be: Maybe the _files files aren't completely loaded and restored in memory when you start crawling. As most of the code is using async-code, and the database loading code too, it actually runs concurrently and might not be finished when the GUI is up. Thus, if you have several hundreds of blogs with a lot of data, and you start the crawl after TumblThree is up, the method for loading the blogs _files databases might still be loading, but the crawl already tries to access them and that fails something I'm catching somewhere else in the code. Maybe if I find some time in the next days I can quickly add a notification event that fires when everything is loaded. Otherwise you might want to check the Task Manager and see for the Disk I/O of TumblThree after starting it, and then maybe start the crawl a bit delayed. If that fixes the problem, then that was the issue. I've tested your settings and the single blog file with 2-3 additionally added blogs, closed TumblThree, re-opened it, and it worked. Edit: That also means there should be no issue if you disable the "global database"-option. |
Hi all.
I've refactored the code a bit and since the current release is rather stable I though I'd release a pre-production release via the "Issues" with the hope that some brave people are willing to test it. All feedback is welcome.
I've fixed some issues:
New features are:
I though of this feature as some kind of trade off for the feature suggestion to avoid downloading duplicates across different blogs (see [Feature Request] Avoiding Duplicates with Global Database #151).
Now you can download everything into a single folder, but the blog databases/indexes stay separated. I'm not sure if that's a good approach. There is still a back-mapping missing if you want to "restore" the files for each blog. Since the .tumblr databases are text files, one could script that. This feature will certainly break the metadata files if you download multiple blogs at once as each downloader will compete for accessing the files, but most likely not the dumping of the crawler data in the bullet point above.
The global database probably requires a lot of testing. I don't have any experience in how a disk based approach like sqlite will behave and perform if multiple concurrent downloads want to access it, nor how it will handle several million entries and/or how the large the memory usage will be. And a server-client based solution is certainly overkill.
I'm not sure about this at all, thus I might remove it again if it has no use. This is completely untested and how it should work in theory. I cannot easily test it due to lack of data/time.
Release a:
You can enable it in the settings->global. You'll have to restart TumblThree afterwards.
Release b:
Release c:
Release e:
Release v1.0.8.34:
Thanks!
The text was updated successfully, but these errors were encountered: