OcrdPage: need a mechanism to avoid segment ID clashes #793

bertsky · 2022-02-02T22:07:21Z

PAGE-XML uses the XMLSchema ID type for all structure segments, which precludes duplicates (makes such usage invalid). However, when doing segmentation on a OcrdPage instance, if some segments already exist and are not removed, one always risks adding segments that clash, e.g. region0001 or a word0001 somewhere deeper.

I can see no mechanism that would aid in preventing such clashes. One could always page.get_AllRegions() and then enumerate all elements down to the Grapheme first, and make a list of all pre-existing segments. But obviously this is a lot of clutter you'd have to c&p in each processor, and then still during instantiation you'd have to evade the existing IDs (more clutter, and difficult).

Wouldn't it be better if we patched the parser to create some initial dictionary of IDs and instances, and then also patched all segment type constructors (and set_id()) such that the dictionary gets updated and clashes automatically avoided?

Related: #510 #313 #699

(The latter gave us page_from_file(... with_tree=True) which we could leverage for the initial ID mapping. But we would still have to patch the constructors and set_id to update that same mapping somehow.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OcrdPage: need a mechanism to avoid segment ID clashes #793

OcrdPage: need a mechanism to avoid segment ID clashes #793

bertsky commented Feb 2, 2022

OcrdPage: need a mechanism to avoid segment ID clashes #793

OcrdPage: need a mechanism to avoid segment ID clashes #793

Comments

bertsky commented Feb 2, 2022