Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix make_file_id #861

Merged
merged 19 commits into from
Oct 21, 2022
Merged

fix make_file_id #861

merged 19 commits into from
Oct 21, 2022

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented May 13, 2022

First we need to have a test for --overwrite that fails…

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I grok this correctly, the problem is that make_file_id is not aware of the overwrite_mode of the workspace and should behave differently. It should not try to avoid clashes because that would lead to inconsistent state, i.e. more files with different IDs instead of overwriting existing files.

One solution could be to add a new kwarg overwrite to the make_file_id method. Not ideal because that requires all the procesors to be adapted and can lead to easy-to-miss inconsistencies across processors.

Since overwrite is a state of the OcrdWorkspace class and there is no link back from OcrdMets to the workspace it represents, we cannot go OcrdFile -> OcrdMets -> OcrdWorkspace - but maybe this should be possible. We could add a reference from OcrdMets to OcrdWorkspace in the latter's __init__.

Then we could check OcrdFile.mets.workspace.overwrite_mode and skip the loop to create unique IDs.

ocrd_utils/ocrd_utils/str.py Show resolved Hide resolved
@bertsky
Copy link
Collaborator Author

bertsky commented May 17, 2022

If I grok this correctly, the problem is that make_file_id is not aware of the overwrite_mode of the workspace and should behave differently. It should not try to avoid clashes because that would lead to inconsistent state, i.e. more files with different IDs instead of overwriting existing files.

Yes. But also, conversely, running against an existing grp without --overwrite should produce clashes when it involves the existing page IDs.

(Of course, that's nothing we can ensure at the level of make_file_id anymore – see my mirror review

One solution could be to add a new kwarg overwrite to the make_file_id method. Not ideal because that requires all the procesors to be adapted and can lead to easy-to-miss inconsistencies across processors.

Yes. Possible but not a good design.

Since overwrite is a state of the OcrdWorkspace class and there is no link back from OcrdMets to the workspace it represents, we cannot go OcrdFile -> OcrdMets -> OcrdWorkspace - but maybe this should be possible. We could add a reference from OcrdMets to OcrdWorkspace in the latter's __init__.

Then we could check OcrdFile.mets.workspace.overwrite_mode and skip the loop to create unique IDs.

Yes, but should OcrdMets really depend on OcrdWorkspace?

My best idea for the moment still is:

But we probably also have to enforce this outside of make_file_id, probably in Workspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,

  • if overwrite, then re-use that fileId
  • otherwise raise an exception

@kba
Copy link
Member

kba commented May 18, 2022

But we probably also have to enforce this outside of make_file_id, probably in Workspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,

  • if overwrite, then re-use that fileId
  • otherwise raise an exception

When OcrdWorkspace.add_file is called in a workspace in overwrite_mode, it will set the force kwarg when doing the call to OcrdMets.add_file. And that in turn checks whether a file with the same ID already exists and if it does, changes it to match the info on the file to add.

IIUC you would replace

mets_file = next(self.find_files(ID=ID), None)

with

mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)

and if that is found and force is True, then change the found file accordingly but retain its old ID.

@bertsky
Copy link
Collaborator Author

bertsky commented May 18, 2022

IIUC you would replace

mets_file = next(self.find_files(ID=ID), None)

with

mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)

and if that is found and force is True, then change the found file accordingly but retain its old ID.

No. Like I said, keep OcrdMets.add_file unchanged, but before delegating to it, in Workspace.add_file, do:

    if kwargs['mimetype'] == MIMETYPE_PAGE:
        # only one PAGE per page
        file_ = next(self.mets.find_files(fileGrp=file_grp, pageId=kwargs['page_id'], mimetype=kwargs['mimetype']), None)
        if file_:
            kwargs['ID'] = file_.ID # let overwrite_mode decide what to do (re-use / raise)
    ret = self.mets.add_file(file_grp, **kwargs)

I'm not sure about file_.local_filename though: OcrdMets.add_file will force re-using the existing file path, but (unless we provide the content= anyway and don't care about the exact filename afterwards) we want our new file path to replace it.

bertsky referenced this pull request May 19, 2022
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 10, 2022

Can we get this moving @kba? #825 is still quite a blow you know...

ocrd_utils/ocrd_utils/str.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_utils/ocrd_utils/str.py Outdated Show resolved Hide resolved
@kba kba marked this pull request as ready for review October 21, 2022 10:07
ocrd_utils/ocrd_utils/str.py Show resolved Hide resolved
ocrd_utils/ocrd_utils/str.py Show resolved Hide resolved
ocrd_utils/ocrd_utils/str.py Show resolved Hide resolved
ocrd_utils/ocrd_utils/str.py Show resolved Hide resolved
ocrd/ocrd/workspace.py Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Show resolved Hide resolved
tests/model/test_ocrd_mets.py Show resolved Hide resolved
tests/processor/test_processor.py Show resolved Hide resolved
tests/processor/test_processor.py Show resolved Hide resolved
tests/processor/test_processor.py Show resolved Hide resolved
@kba kba merged commit ea064f9 into OCR-D:master Oct 21, 2022
@bertsky
Copy link
Collaborator Author

bertsky commented Oct 21, 2022

You said re-review, which I did, but now you already merged without my suggestions…

@kba
Copy link
Member

kba commented Oct 21, 2022

You said re-review, which I did, but now you already merged without my suggestions…

I'll amend.

@kba
Copy link
Member

kba commented Oct 21, 2022

You said re-review, which I did, but now you already merged without my suggestions…

I'll amend.

Done in https://github.com/OCR-D/core/tree/861-re-review. Many thanks for the documentation fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants