Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_file_id: not correct for --overwrite #825

Closed
bertsky opened this issue Mar 22, 2022 · 3 comments
Closed

make_file_id: not correct for --overwrite #825

bertsky opened this issue Mar 22, 2022 · 3 comments
Assignees
Labels

Comments

@bertsky
Copy link
Collaborator

bertsky commented Mar 22, 2022

If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then make_file_id determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.

Alas,

  1. the list returned by OcrdMets.find_files is not sorted by page ID (but mets:file element order), so that index may deviate from the page ID.
  2. the increment strategy is wrong in combination with --overwrite, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

@bertsky
Copy link
Collaborator Author

bertsky commented May 13, 2022

@kba this is a very nasty bug that prevents --overwrite for me in a lot of cases (and makes repairing the METS afterwards very hard). RFC

@kba
Copy link
Member

kba commented May 13, 2022

RFC

Sry, this one got lost in the shuffle.

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

Agreed in general, though we need a fallback for the case that a file has no pageId - which should not happen in real life data but is not strictly required.

I'll prepare a PR.

@kba
Copy link
Member

kba commented Oct 25, 2022

Was hopefully finally fixed in #861 and released in 2.39.0.

@kba kba closed this as completed Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants