-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746
Comments
Doesn't really address the default config issue since it's dependent on the extractor, but with kemono/coomer, the api returns file hashes (iirc SHA256 is used) and can be used as a more specific way to ensure duplicates aren't downloaded like so: |
Yes, I originally thought it may affect more sites, but given that these two websites in question seem to be pretty unique in how they provide multiple revisions of a download target, perhaps addressing this issue is better done on a per-website/extractor basis. I am not sure if other websites use a similar revision system but if they do then a similar solution could be used for their extractor, depending on the info that can be extracted. And I agree, the file hash for this extractor would be a much better solution to ensure no shared entries in sqlite3. I'll make a new PR. |
With these you'll only download unique files. Use Kemono's API to sort the files afterwards. There is literally no point in trying to sort files while downloading from Kemono because of how they handle revisions. |
Edit: After experimenting with it myself, it seems like you might just need:
or I misunderstood the issue (more likely) |
@sntrenter From my own tests, using
Quick edit: Reminder that, to enable the sqlite3 archive, you also need to use |
The Issue
I am not sure if this can be called a bug but it's a setting that might not produce the intended results. An issue exist where if
extractor.*.skip
istrue
then some files with multiple revisions, such as fromkemonoparty
andcoomerparty
, will not be downloaded ifextractor.*.archive-format
is currently set to the default of"{service}_{user}_{id}_{num}"
; which can be checked using the-E
option.How To Reproduce
For the following URL,
we extract session info using:
If the previously discussed conditions above are set, the object entries with attributes
"filename": "577611769514565632_preview"
and"filename": "577608964548603905"
will both get assigned"num": 1
and thus, only one of these files will be downloaded while the second one in the download order will be skipped since the entry in the sqlite3 archive for both files will be identical due to both files sharing the samenum
value. Both files generate the following entry in the sqlite3 archive in spite of having different filenames:coomerpartyfansly_307507152082186240_577611859612409857_1
.Workarounds
extractor.*.archive-format
to something more unique, like"{service}_{user}_{id}_{filename}_{extension}_{num}"
.extractor.*.skip
tofalse
, (which should have the same(?) effect as using the--no-skip
option). This will download everything again so not the best solution.The first option will break legacy support for previous entries already in the sqlite3 archive. Still, if this behavior is indeed unintended, then the first option is probably the best solution.
Other URLs Also Affected
The text was updated successfully, but these errors were encountered: