Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OcrdPage: need a mechanism to avoid segment ID clashes #793

Open
bertsky opened this issue Feb 2, 2022 · 0 comments
Open

OcrdPage: need a mechanism to avoid segment ID clashes #793

bertsky opened this issue Feb 2, 2022 · 0 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Feb 2, 2022

PAGE-XML uses the XMLSchema ID type for all structure segments, which precludes duplicates (makes such usage invalid). However, when doing segmentation on a OcrdPage instance, if some segments already exist and are not removed, one always risks adding segments that clash, e.g. region0001 or a word0001 somewhere deeper.

I can see no mechanism that would aid in preventing such clashes. One could always page.get_AllRegions() and then enumerate all elements down to the Grapheme first, and make a list of all pre-existing segments. But obviously this is a lot of clutter you'd have to c&p in each processor, and then still during instantiation you'd have to evade the existing IDs (more clutter, and difficult).

Wouldn't it be better if we patched the parser to create some initial dictionary of IDs and instances, and then also patched all segment type constructors (and set_id()) such that the dictionary gets updated and clashes automatically avoided?

Related: #510 #313 #699

(The latter gave us page_from_file(... with_tree=True) which we could leverage for the initial ID mapping. But we would still have to patch the constructors and set_id to update that same mapping somehow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant