Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageType.get_AllRegions to list all kinds of regions #479

Merged
merged 36 commits into from
Jun 4, 2020
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
abef352
PageType.get_AllRegions to list all kinds of regions
kba May 13, 2020
3445f87
Update ocrd_models/ocrd_page_user_methods.py
bertsky May 14, 2020
a48b8c1
update generateds page, add region filter if using reading order, wip
kba May 14, 2020
f51a2e4
Merge branch 'hotfix-ocrd-page-exports' into get-all-regions
kba May 14, 2020
8da3f3c
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 14, 2020
d2a01bb
refactoring: move generateDS methods to their own files
kba May 15, 2020
be7f026
get_AllRegions: adapt to signature proposed in #240, test with order=…
kba May 15, 2020
e1740f7
README: explain how to add user methods to PAGE API
kba May 15, 2020
6f9163e
Update ocrd_models/README.md
kba May 28, 2020
0c73b3e
Update ocrd_models/README.md
kba May 28, 2020
5c2f3a8
Update ocrd_models/README.md
kba May 28, 2020
6a57506
recursion (with both finite or arbitrary depth) for get_AllRegions
kba May 28, 2020
a9072c8
regenerate PAGE API
kba May 28, 2020
ac62b85
get_AllRegions: clean-up merge artifacts and reorganize
kba May 28, 2020
fd6d545
Update ocrd_models/ocrd_page_user_methods/get_AllRegions.py
kba May 28, 2020
86a7133
get_AllRegions: _region_id method unneccessary now
kba May 28, 2020
ce06392
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 28, 2020
5c8d89b
regenerate PAGE API
kba May 28, 2020
f6e3da5
:art: pylint
kba May 28, 2020
8351056
add_AllIndexed -> extend_AllIndexed
kba May 28, 2020
f202205
get_AllRegions: differentiate "reading-order"/"reading-order-only"
kba May 28, 2020
ffba6f9
get_AllRegions: catch negative depth, test depth==0
kba May 29, 2020
207f396
:memo: get_AllRegions: document example
bertsky May 29, 2020
9ced315
get_AllRegions: fix recursion
kba May 29, 2020
629f38d
get_AllRegions: Update example
kba May 29, 2020
e958559
wip
kba May 29, 2020
1964563
reading order test sample: add unorderedgroups for testing
kba May 29, 2020
27e256f
add get_UnorderedGroupChildren, let get_AllIndexed handle UnorderedGr…
kba May 29, 2020
1b17e3f
get_AllIndexed: allow filtering by child type
kba May 29, 2020
ae613cf
get_AllIndexed: index_sort parameter to enable/disable sorting
kba May 29, 2020
b1df95f
add sort_AllIndexed to sort in-place
kba May 29, 2020
fd9dc83
extend_AllIndexed: increment @index when adding elements
kba May 29, 2020
9d0e539
Merge branch 'master' into get-all-regions
kba May 29, 2020
84f1d33
:memo: changelog
kba May 29, 2020
0e14633
Document extend_AllIndexed validate_contiunuity param
kba Jun 3, 2020
b79474a
Merge branch 'master' into get-all-regions
kba Jun 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,22 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

Added:

* OcrdPage: `get_AllRegions`: retrieve all regions, sorted by document or reading order, #479
* OcrdPage: `sort_AllIndexed`: sort all children by `@index` in-place
* OcrdPage: `clear_AllIndexed`: clear all `@index` children
* OcrdPage: `extend_AllIndexed`: Add elements with incrementing `@index`
* OcrdPage: Replace empty reading order groups with equivalent `RegionRef` on export
* OcrdPage: `get_UnorderedGroupChildren`: get reading order elements of an `UnorderedGroup`


Changed:

* OcrdPage: `get_AllIndexed`: allow filtering by child type
* OcrdPage: `get_AllIndexed`: index_sort parameter to enable/disable sorting


## [2.7.1] - 2020-05-27

Fixed:
Expand Down
158 changes: 135 additions & 23 deletions ocrd_models/ocrd_models/ocrd_page_generateds.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

#
# Generated Fri May 29 16:34:32 2020 by generateDS.py version 2.35.20.
# Generated Fri May 29 23:29:23 2020 by generateDS.py version 2.35.20.
# Python 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
#
# Command line options:
Expand Down Expand Up @@ -2908,8 +2908,8 @@ def get_AllRegions(self, classes=None, order='document', depth=0):

For example, to get all text anywhere on the page in reading order, use:
::
'\n'.join(line.get_TextEquiv()[0].Unicode
for region in page.get_AllRegions(classes='Text', depth=0, order='reading-order')
'\\n'.join(line.get_TextEquiv()[0].Unicode
for region in page.get_AllRegions(classes=['Text'], depth=0, order='reading-order')
for line in region.get_TextLine())
"""
if order not in ['document', 'reading-order', 'reading-order-only']:
Expand Down Expand Up @@ -5433,21 +5433,65 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroupIndexed'
def __hash__(self):
return hash(self.id)
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)
# pylint: disable=invalid-name,missing-module-docstring,line-too-long
def get_AllIndexed(self, classes=None, index_sort=True):
"""
Get all indexed children sorted by their ``@index``.

Arguments:
classes (list): Type of children to return. Default: ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
index_sort (boolean): Whether to sort by ``@index``
"""
if not classes:
classes = ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
ret = []
for class_ in classes:
ret += getattr(self, 'get_{}Indexed'.format(class_))()
if index_sort:
return sorted(ret, key=lambda x: x.index)
return ret
def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def extend_AllIndexed(self, elements):
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def extend_AllIndexed(self, elements, validate_continuity=False):
"""
Add all elements in list ``elements``, respecting ``@index`` order.
"""
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x: x.index):
siblings = self.get_AllIndexed()
highest_sibling_index = siblings[-1].index if siblings else -1
if validate_continuity:
elements = sorted(elements, key=lambda x: x.index)
lowest_element_index = elements[0].index
if lowest_element_index <= highest_sibling_index:
raise Exception("@index already used: {}".format(lowest_element_index))
else:
for element in elements:
highest_sibling_index += 1
element.index = highest_sibling_index
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def sort_AllIndexed(self, validate_uniqueness=True):
"""
Sort all indexed children in-place.
"""
elements = self.get_AllIndexed(index_sort=True)
self.clear_AllIndexed()
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
Expand All @@ -5464,13 +5508,18 @@ def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xml
for Labels_ in self.Labels:
Labels_.export(outfile, level, namespaceprefix_, namespacedef_='', name_='Labels', pretty_print=pretty_print)
cleaned = []
def replaceWithRRI(group):
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = group.index
rri.regionRef = group.regionRef
cleaned.append(rri)
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed(): # pylint: disable=undefined-variable
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
# pylint: disable=undefined-variable
if isinstance(entry, (OrderedGroupIndexedType)) and not entry.get_AllIndexed():
replaceWithRRI(entry)
elif isinstance(entry, UnorderedGroupIndexedType) and not entry.get_UnorderedGroupChildren():
replaceWithRRI(entry)
else:
cleaned.append(entry)
for entry in cleaned:
Expand Down Expand Up @@ -5811,6 +5860,13 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroup'
def __hash__(self):
return hash(self.id)
def get_UnorderedGroupChildren(self):
"""
List all non-metadata children of an UnorderedGroup
"""
# TODO: should not change order
return self.get_RegionRef() + self.get_OrderedGroup() + self.get_UnorderedGroup()

# end class UnorderedGroupIndexedType


Expand Down Expand Up @@ -6223,21 +6279,65 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroupIndexed'
def __hash__(self):
return hash(self.id)
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)
# pylint: disable=invalid-name,missing-module-docstring,line-too-long
def get_AllIndexed(self, classes=None, index_sort=True):
"""
Get all indexed children sorted by their ``@index``.

Arguments:
classes (list): Type of children to return. Default: ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
index_sort (boolean): Whether to sort by ``@index``
"""
if not classes:
classes = ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
ret = []
for class_ in classes:
ret += getattr(self, 'get_{}Indexed'.format(class_))()
if index_sort:
return sorted(ret, key=lambda x: x.index)
return ret
def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def extend_AllIndexed(self, elements):
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def extend_AllIndexed(self, elements, validate_continuity=False):
"""
Add all elements in list ``elements``, respecting ``@index`` order.
"""
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x: x.index):
siblings = self.get_AllIndexed()
highest_sibling_index = siblings[-1].index if siblings else -1
if validate_continuity:
elements = sorted(elements, key=lambda x: x.index)
lowest_element_index = elements[0].index
if lowest_element_index <= highest_sibling_index:
raise Exception("@index already used: {}".format(lowest_element_index))
else:
for element in elements:
highest_sibling_index += 1
element.index = highest_sibling_index
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def sort_AllIndexed(self, validate_uniqueness=True):
"""
Sort all indexed children in-place.
"""
elements = self.get_AllIndexed(index_sort=True)
self.clear_AllIndexed()
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
Expand All @@ -6254,13 +6354,18 @@ def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xml
for Labels_ in self.Labels:
Labels_.export(outfile, level, namespaceprefix_, namespacedef_='', name_='Labels', pretty_print=pretty_print)
cleaned = []
def replaceWithRRI(group):
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = group.index
rri.regionRef = group.regionRef
cleaned.append(rri)
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed(): # pylint: disable=undefined-variable
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
# pylint: disable=undefined-variable
if isinstance(entry, (OrderedGroupIndexedType)) and not entry.get_AllIndexed():
replaceWithRRI(entry)
elif isinstance(entry, UnorderedGroupIndexedType) and not entry.get_UnorderedGroupChildren():
replaceWithRRI(entry)
else:
cleaned.append(entry)
for entry in cleaned:
Expand Down Expand Up @@ -6585,6 +6690,13 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroup'
def __hash__(self):
return hash(self.id)
def get_UnorderedGroupChildren(self):
"""
List all non-metadata children of an UnorderedGroup
"""
# TODO: should not change order
return self.get_RegionRef() + self.get_OrderedGroup() + self.get_UnorderedGroup()

# end class UnorderedGroupType


Expand Down
2 changes: 2 additions & 0 deletions ocrd_models/ocrd_page_user_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,9 @@ def _add_method(class_re, method_name):
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'get_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'clear_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'extend_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'sort_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'exportChildren'),
_add_method(r'^(UnorderedGroupType|UnorderedGroupIndexedType)$', 'get_UnorderedGroupChildren'),
_add_method(r'^(PageType)$', 'get_AllRegions'),
)

Expand Down
15 changes: 10 additions & 5 deletions ocrd_models/ocrd_page_user_methods/exportChildren.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,18 @@ def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xml
for Labels_ in self.Labels:
Labels_.export(outfile, level, namespaceprefix_, namespacedef_='', name_='Labels', pretty_print=pretty_print)
cleaned = []
def replaceWithRRI(group):
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = group.index
rri.regionRef = group.regionRef
cleaned.append(rri)
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed(): # pylint: disable=undefined-variable
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
# pylint: disable=undefined-variable
if isinstance(entry, (OrderedGroupIndexedType)) and not entry.get_AllIndexed():
replaceWithRRI(entry)
elif isinstance(entry, UnorderedGroupIndexedType) and not entry.get_UnorderedGroupChildren():
replaceWithRRI(entry)
else:
cleaned.append(entry)
for entry in cleaned:
Expand Down
20 changes: 17 additions & 3 deletions ocrd_models/ocrd_page_user_methods/extend_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,22 @@
# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def extend_AllIndexed(self, elements):
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def extend_AllIndexed(self, elements, validate_continuity=False):
"""
Add all elements in list ``elements``, respecting ``@index`` order.
"""
kba marked this conversation as resolved.
Show resolved Hide resolved
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x: x.index):
siblings = self.get_AllIndexed()
highest_sibling_index = siblings[-1].index if siblings else -1
if validate_continuity:
elements = sorted(elements, key=lambda x: x.index)
lowest_element_index = elements[0].index
if lowest_element_index <= highest_sibling_index:
raise Exception("@index already used: {}".format(lowest_element_index))
else:
for element in elements:
highest_sibling_index += 1
element.index = highest_sibling_index
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
Expand Down
18 changes: 16 additions & 2 deletions ocrd_models/ocrd_page_user_methods/get_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)
# pylint: disable=invalid-name,missing-module-docstring,line-too-long
def get_AllIndexed(self, classes=None, index_sort=True):
"""
Get all indexed children sorted by their ``@index``.

Arguments:
classes (list): Type of children to return. Default: ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
index_sort (boolean): Whether to sort by ``@index``
"""
if not classes:
classes = ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
ret = []
for class_ in classes:
ret += getattr(self, 'get_{}Indexed'.format(class_))()
if index_sort:
return sorted(ret, key=lambda x: x.index)
return ret
4 changes: 2 additions & 2 deletions ocrd_models/ocrd_page_user_methods/get_AllRegions.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ def get_AllRegions(self, classes=None, order='document', depth=0):

For example, to get all text anywhere on the page in reading order, use:
::
'\n'.join(line.get_TextEquiv()[0].Unicode
for region in page.get_AllRegions(classes='Text', depth=0, order='reading-order')
'\\n'.join(line.get_TextEquiv()[0].Unicode
for region in page.get_AllRegions(classes=['Text'], depth=0, order='reading-order')
for line in region.get_TextLine())
"""
if order not in ['document', 'reading-order', 'reading-order-only']:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
def get_UnorderedGroupChildren(self):
"""
List all non-metadata children of an UnorderedGroup
"""
# TODO: should not change order
kba marked this conversation as resolved.
Show resolved Hide resolved
return self.get_RegionRef() + self.get_OrderedGroup() + self.get_UnorderedGroup()

16 changes: 16 additions & 0 deletions ocrd_models/ocrd_page_user_methods/sort_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def sort_AllIndexed(self, validate_uniqueness=True):
"""
Sort all indexed children in-place.
"""
elements = self.get_AllIndexed(index_sort=True)
self.clear_AllIndexed()
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

5 changes: 5 additions & 0 deletions tests/model/TEMP1_Gutachten2-2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,11 @@
</pc:OrderedGroupIndexed>
<pc:RegionRefIndexed index="18" regionRef="Gutachten2-2_region0016"/>
<pc:RegionRefIndexed index="19" regionRef="Gutachten2-2_region0017"/>
<pc:UnorderedGroupIndexed id="unordered-group-for-testing_group" regionRef="unordered-group-for-testing" index="20">
<pc:RegionRef regionRef="unordered-group-for-testing_region0001"/>
</pc:UnorderedGroupIndexed>
<pc:UnorderedGroupIndexed id="empty-group-for-testing_group" regionRef="empty-group-for-testing" index="21">
</pc:UnorderedGroupIndexed>
</pc:OrderedGroup>
</pc:ReadingOrder>
<pc:TextRegion id="Gutachten2-2_region0001">
Expand Down
Loading