Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache reuse for small subset of files #1064

Closed
chapmanjacobd opened this issue Sep 21, 2023 · 4 comments · Fixed by #1072
Closed

cache reuse for small subset of files #1064

chapmanjacobd opened this issue Sep 21, 2023 · 4 comments · Fixed by #1072

Comments

@chapmanjacobd
Copy link

chapmanjacobd commented Sep 21, 2023

I noticed that czkawka is taking >12 hours on a small directory so I popped open strace to investigate. It looks like despite only referencing one directory with no subdirectories nor hard/soft links inside of it, czkawka is checking that every file in the cache exists:

$ strace -p (pgrep -n czkawka)
statx(AT_FDCWD, "/home/xk/d/91_New_Art/blahblah.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=494475, ...}) = 0

When I override the HOME environment variable it only takes 2 seconds to run:

HOME=$(mktemp -d) czkawka image -d /home/xk/d/98_Me

Perhaps it would be better to only run statx on each path only if they are a subpath of any defined directories.

ls ~/.cache/czkawka/cache_similar_images_16_Gradient_Nearest_50.bin 
Permissions Size User Date Modified Name
.rw-r--r--@ 2.2G xk   17 Sep 20:31  /home/xk/.cache/czkawka/cache_similar_images_16_Gradient_Nearest_50.bin

Additionally, czkawka stats each file in the cache even if --image_delete_outdated_cache_entries is false in ~/.config/czkawka/czkawka_gui_config_4.txt

@qarmin
Copy link
Owner

qarmin commented Oct 5, 2023

I rechecked how exactly this works and I don't really see a big problem in algorithm(at least with disabled delete outdated result option).

  • Valid files are collected
  • Cache file is loaded with all entries
  • if option to remove outdated files is enabled, then cache files that not exists are removed
  • Returned is structure with all entries
  • Structure split into two parts:
    • already cached files - files that have same modification date, size and path like tested files
    • non cached files - files that changed or that were not cached
  • non cached files are checked and at after operation are connected with cached, and saved to file

Few things that can be improved(I'm working on it):

  • multithreading checking for outdated files
  • code is duplicated in several parts

Additionally, czkawka stats each file in the cache even if --image_delete_outdated_cache_entries is false in ~/.config/czkawka/czkawka_gui_config_4.txt

this looks like bug, but with current master branch I cannot reproduce problem, can you reproduce this with e.g. #1072 (should use new cache files format)?

@chapmanjacobd
Copy link
Author

chapmanjacobd commented Oct 5, 2023

I don't really see a big problem in algorithm

I guess my main point in the prior message is that the algorithm should probably pre-filter the list of files via directory prefix before doing IO. It is much, much faster for the CPU to do a substring check than for the CPU to ask IO about a specific file. It really makes a big difference, especially in my specific case where the cache has many files outside of the directory that was specified for a specific run.

To put it into context, the user specifies a list of directories -d /dir1/ -d /dir2/. If the files from the cache are neither in /dir1/ nor /dir2/ it seems unnecessary for czkawka to do IO lookups outside of those two directories.

But yes, I see you have spent a lot of time thinking about it in #1072 so I will try it out and see if I still experience the above issue.

Thank you for looking into this

@chapmanjacobd
Copy link
Author

chapmanjacobd commented Oct 5, 2023

After building from loading_saving e9765e1

cargo run --release --bin czkawka_cli image -d /home/xk/d/98_Me

it runs in only a few seconds so I think you have fixed the issue :)

@qarmin
Copy link
Owner

qarmin commented Oct 7, 2023

With #1064, loading from ssd ~100000 cached results(in duplicate files mode) with testing if all files exists takes less than second, so not sure what exactly is/was a problem(I'm talking about 12 hours of cache processing.

RUST_LOG=debug ./czkawka - should provide more info about timings, if issue will still persist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants