You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PAGE-XML uses the XMLSchema ID type for all structure segments, which precludes duplicates (makes such usage invalid). However, when doing segmentation on a OcrdPage instance, if some segments already exist and are not removed, one always risks adding segments that clash, e.g. region0001 or a word0001 somewhere deeper.
I can see no mechanism that would aid in preventing such clashes. One could always page.get_AllRegions() and then enumerate all elements down to the Grapheme first, and make a list of all pre-existing segments. But obviously this is a lot of clutter you'd have to c&p in each processor, and then still during instantiation you'd have to evade the existing IDs (more clutter, and difficult).
Wouldn't it be better if we patched the parser to create some initial dictionary of IDs and instances, and then also patched all segment type constructors (and set_id()) such that the dictionary gets updated and clashes automatically avoided?
(The latter gave us page_from_file(... with_tree=True) which we could leverage for the initial ID mapping. But we would still have to patch the constructors and set_id to update that same mapping somehow.)
The text was updated successfully, but these errors were encountered:
PAGE-XML uses the XMLSchema
ID
type for all structure segments, which precludes duplicates (makes such usage invalid). However, when doing segmentation on aOcrdPage
instance, if some segments already exist and are not removed, one always risks adding segments that clash, e.g.region0001
or aword0001
somewhere deeper.I can see no mechanism that would aid in preventing such clashes. One could always
page.get_AllRegions()
and then enumerate all elements down to theGrapheme
first, and make a list of all pre-existing segments. But obviously this is a lot of clutter you'd have to c&p in each processor, and then still during instantiation you'd have to evade the existing IDs (more clutter, and difficult).Wouldn't it be better if we patched the parser to create some initial dictionary of IDs and instances, and then also patched all segment type constructors (and
set_id()
) such that the dictionary gets updated and clashes automatically avoided?Related: #510 #313 #699
(The latter gave us
page_from_file(... with_tree=True)
which we could leverage for the initial ID mapping. But we would still have to patch the constructors andset_id
to update that same mapping somehow.)The text was updated successfully, but these errors were encountered: