make_file_id: not correct for --overwrite #825

bertsky · 2022-03-22T22:33:31Z

If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then make_file_id determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.

Alas,

the list returned by OcrdMets.find_files is not sorted by page ID (but mets:file element order), so that index may deviate from the page ID.
the increment strategy is wrong in combination with --overwrite, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

The text was updated successfully, but these errors were encountered:

bertsky · 2022-05-13T14:36:48Z

@kba this is a very nasty bug that prevents --overwrite for me in a lot of cases (and makes repairing the METS afterwards very hard). RFC

kba · 2022-05-13T16:07:01Z

RFC

Sry, this one got lost in the shuffle.

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

Agreed in general, though we need a fallback for the case that a file has no pageId - which should not happen in real life data but is not strictly required.

I'll prepare a PR.

kba · 2022-10-25T14:02:09Z

Was hopefully finally fixed in #861 and released in 2.39.0.

kba self-assigned this May 13, 2022

kba added the bug label May 13, 2022

kba mentioned this issue May 13, 2022

make_file_id: use pageId for generating file ID if possible #860

Closed

bertsky mentioned this issue Jun 10, 2022

fix make_file_id #861

Merged

bertsky mentioned this issue Jun 28, 2022

allow re-running slub/ocrd_manager#26

Merged

kba mentioned this issue Aug 14, 2022

https://github.com/OCR-D/core/pull/861 continued bertsky/core#5

Closed

kba closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make_file_id: not correct for --overwrite #825

make_file_id: not correct for --overwrite #825

bertsky commented Mar 22, 2022 •

edited

Loading

bertsky commented May 13, 2022

kba commented May 13, 2022

kba commented Oct 25, 2022

make_file_id: not correct for --overwrite #825

make_file_id: not correct for --overwrite #825

Comments

bertsky commented Mar 22, 2022 • edited Loading

bertsky commented May 13, 2022

kba commented May 13, 2022

kba commented Oct 25, 2022

bertsky commented Mar 22, 2022 •

edited

Loading