You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then make_file_id determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.
Alas,
the list returned by OcrdMets.find_files is not sorted by page ID (but mets:file element order), so that index may deviate from the page ID.
the increment strategy is wrong in combination with --overwrite, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)
To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.
The text was updated successfully, but these errors were encountered:
To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.
Agreed in general, though we need a fallback for the case that a file has no pageId - which should not happen in real life data but is not strictly required.
If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then
make_file_id
determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.Alas,
OcrdMets.find_files
is not sorted by page ID (butmets:file
element order), so that index may deviate from the page ID.--overwrite
, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.
The text was updated successfully, but these errors were encountered: