[WIP] Generateds with etree #313

kba · 2019-09-13T10:30:17Z

@bertsky Here's a quick shot at getting etree references into the generated PAGE API, so you can access parent elements/objects, search the tree with xpath etc.

This is only the proof of concept how to set this up, we still need to discuss an API.

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

codecov-io · 2019-09-13T10:34:59Z

Codecov Report

Merging #313 into master will decrease coverage by 8.73%.
The diff coverage is 69.05%.

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
- Coverage   91.18%   82.44%   -8.74%     
==========================================
  Files          30       39       +9     
  Lines        1610     2353     +743     
  Branches      308      424     +116     
==========================================
+ Hits         1468     1940     +472     
- Misses        107      340     +233     
- Partials       35       73      +38

Impacted Files	Coverage Δ
..._validators/ocrd_validators/parameter_validator.py	`100% <ø> (ø)`	⬆️
ocrd/ocrd/__init__.py	`100% <ø> (ø)`	⬆️
ocrd_utils/ocrd_utils/constants.py	`100% <ø> (ø)`	⬆️
ocrd/ocrd/cli/zip.py	`67.44% <0%> (ø)`
ocrd/ocrd/processor/__init__.py	`100% <100%> (ø)`
ocrd/ocrd/resolver.py	`98.85% <100%> (+0.05%)`	⬆️
ocrd_utils/ocrd_utils/logging.py	`100% <100%> (ø)`	⬆️
ocrd/ocrd/workspace_bagger.py	`100% <100%> (ø)`	⬆️
ocrd/ocrd/cli/validate.py	`100% <100%> (ø)`
ocrd/ocrd/cli/__init__.py	`100% <100%> (ø)`
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db55b59...8ee1e67. Read the comment docs.

bertsky · 2019-09-13T21:16:09Z

Very interesting! I can see a parent_object_ member in the direct model now, and lots of nice functions like getparent() and other getters, iterators and so on lxml model, i.e. the object from to_etree(). (But how do I get back from the etree to the DOM?)

This is only the proof of concept how to set this up, we still need to discuss an API.

Splendid! Yes, maybe we could look for examples of APIs for similar schemas?

The original Java API will probably not help, because it is too far away from the XSD identifiers. But a bare etree API is inadequate, too (e.g. tei-reader simply exposes the lxml etree via .node).

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

I am afraid I do not understand that really. My original idea (in #240) was to use generateDS's user methods to supplement the code, not the runtime lxml injection.

kba · 2019-10-08T14:10:40Z

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

I am afraid I do not understand that really. My original idea (in #240) was to use generateDS's user methods to supplement the code, not the runtime lxml injection.

Say you have TextLineType t object and want to get the etree element e for that object, to execute an Xpath or whatever.

This could be accomplished with generateDS user methods, so you could do e.g. e = t.getElement(). But how do you get t back from e?

The other option would be to extend/monkey-patch just the PcGtsType or use pure functions:

p = page_from_file(...)
# traverse with PAGE API until TextLineType level
e = p.getElement(t)
t = p.getObject(e)
# or with a function:
e = getElement(t)
t = getObject(t, p)

If you can do that in both directions, would that not be enough?

bertsky · 2019-10-09T20:02:36Z

This could be accomplished with generateDS user methods, so you could do e.g. e = t.getElement().

Oh, okay, so this was already about how to avoid runtime injection, great.

But how do you get t back from e?

Exactly my question earlier above (from the etree to the DOM). I cannot believe lxml would not offer this.

If you can do that in both directions, would that not be enough?

All 3 options to get the parent look viable (although I would prefer the first).

I even think we should not expose the etree unnecessarily, it would be sufficient to have getParent() or pcgts.getParentFor(hierarchy_element) or getParentFor(hierarchy_element, pcgts).

But we do not need the other direction, because we have PcGtsType.get_Page(), PageType.get_TextRegion(), TextRegionType.get_TextLine(), TextLineType.get_Word(), WordType.get_Glyph() already.

What we do still need (as already mentioned in #240) though, is:

a general PageType.get_Region() (regardless of type)
a region iterator along the (recursive) reading order

bertsky · 2020-01-07T13:24:36Z

@kba we already have parent_object_ members ever since a131d14, e.g.

core/ocrd_models/ocrd_models/ocrd_page_generateds.py

Line 870 in 8e080d1

self.parent_object_ = kwargs_.get('parent_object_')

Is this intended? Is it going to stay?

kba · 2020-01-07T13:27:16Z

Is this intended? Is it going to stay?

Was not intended but is going to stay. Do you need any API beyond what generateDS provides with etree features retained in core/ocrd_models/ocrd_models/ocrd_page_generateds.py?

bertsky · 2020-01-07T14:36:32Z

Was not intended but is going to stay.

Fabulous!

Do you need any API beyond what generateDS provides with etree features retained in core/ocrd_models/ocrd_models/ocrd_page_generateds.py?

Yes (but not quite as urgent): as mentioned above...

What we do still need (as already mentioned in #240) though, is:

a general PageType.get_Region() (regardless of type)

a region iterator along the (recursive) reading order

Especially reading order handling is hard to get right, because of the many options in PAGE-XML (ordered vs unordered inside ordered vs unordered, recursive). If we want to get correct+complete implementations for this (not just top-level ordered group), we should offer a simple to use abstraction soon. For an example of (how I believe) how it should be done, see here.

bertsky · 2020-01-22T09:48:44Z

Additional demand:

something like obj2el() (either on the root PcGtsType or on any type) for easy lookup of elements by IDs without having to (recursively) search the tree

(A first use-case for this would be ocrd-segment-repair with correct-coords going through the page_validator.CoordinateValidityError and page_validator.CoordinateConsistencyError cases, trying to repair them one by one. These exceptions don't reference the actual element objects, but only their IDs.)

bertsky · 2020-01-22T11:21:19Z

A first use-case for this would be ocrd-segment-repair with correct-coords going through the page_validator.CoordinateValidityError and page_validator.CoordinateConsistencyError cases, trying to repair them one by one. These exceptions don't reference the actual element objects, but only their IDs.

Or should we rather enrich these exception classes by references to the layout element objects?

…DS/etree

kba · 2020-01-22T18:28:39Z

I've implemented a rudimentary extension of the generateDS code to allow

getting etree elements for generateDS objects and vice versa
getting etree elements and generateDS objects by xml:id
iterating over all page:*RegionType elements

The API will have to change but at least this gives developers the possibility to use xpath and get elements by ID. Once we can update generateDS, we should invest the time to create a proper integration by subclassing the generated classes.

@bertsky Can you live with this for the time being and does this allow you to move forward?

cneud · 2020-01-22T22:09:20Z

ocrd_models/ocrd_models/ocrd_page.py

@@ -61,6 +67,14 @@

 from .constants import NAMESPACES

+def di(ID):


Generated code, monkey patch and then this - this PR clearly presents a path to the dark side...

More seriously, please convince me this is necessary yet safe enough to use here and not encounter side effects as also mentioned in the SO post comments. A very positive remark from the pilot libraries testing was that there didn't occur any crashes really.

My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.

On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id does look like the ultimate prize for writing better processors.

We could patch generateDS to keep actual references instead of memory adress srepresentations as the mapping key (i.e. replace the id(obj) calls with just obj). The intention with id() is to not prevent GC from freeing memory of references to generateDS classes that would be kept around as long as the mapping dict remains in scope. However, this is only in one direction, the reverse mapping uses etree Elements as key. And if anything changes in the PAGE via API, these references should be invalidated because the Elements do not reflect the state of the PAGE document anymore. If no dict is passed for the mapping_ parameter, it uses a method-local dict to hold the references and this should not be a problem since the references would go out of scope immediately.

For the "search-by-id" and "iterate-over-all-region-types" operation, walking through the object tree every time nstead of using an index that can become invalidated all the time might be a worthy tradeoff. Essentially, reimplementing ET.findall.

In any case, monkey-patching is a bad idea, obviously, it's brittle and becomes invalid for every change to the document.

cneud · 2020-01-22T22:16:11Z

ocrd_models/ocrd_models/ocrd_page.py

+        self.mapping_obj2el = {}
+        self.root_el = self.root_obj.to_etree(None, name_='PcGts', mapping_=self.mapping_obj2el)
+        self.mapping_el2obj = dict(((v, k) for k, v in self.mapping_obj2el.items()))
+        LOG.debug("OcrdPageExt.update_mappings took %ss", perf_counter()-t0)


Curious whether you tested how this affects performance - thinking of e.g. large documents like newspapers or pages with polygonal glyph segmentation that can have thousands of elements.

update_mappings will degrade performance, I cannot say by how much. But certainly enough not to call it every time the document changes.

bertsky

If there were any other way...

I recommend to postpone this for our 3.0 milestone, so we can test thoroughly.

In the meantime, for coordinate validation/repair and reading order iterators/repair, we should patch our classes with simple stupid id2obj dicts as needed.

bertsky · 2020-01-23T00:30:30Z

ocrd_models/ocrd_models/ocrd_page.py

@@ -61,6 +67,14 @@

 from .constants import NAMESPACES

+def di(ID):


My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.

On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id does look like the ultimate prize for writing better processors.

kba · 2020-06-09T17:26:07Z

I think we should close this. We have a region iterator now and the get_AllAlternativeImages user method can be used as a basis for methods that require an etree document. The solution here is an interesting proof-of-concept but not production-ready. Instead of a generalized etree/generateDS mapping solution with all its pitfalls, I recommend we extend the generateDS API with methods for specific use cases, like #510.

kba added 2 commits September 13, 2019 12:26

POC to get etree references in generateDS code

d08466e

regenerate PAGE code

db55b59

EEngl52 assigned kba Oct 7, 2019

bertsky mentioned this pull request Dec 6, 2019

Feature request: status line text with segment IDs qurator-spk/dinglehopper#9

Open

kba added this to the Final workshop milestone Jan 22, 2020

kba added 3 commits January 22, 2020 15:54

Merge branch 'master' into generateds-with-etree

82f9809

🙈-patching generateDS coded with extension class for mapping generate…

206eead

…DS/etree

get_all_regions to iterate over all the different PAGE region types

8ee1e67

kba requested review from bertsky, cneud and wrznr January 22, 2020 18:28

cneud approved these changes Jan 22, 2020

View reviewed changes

bertsky approved these changes Jan 23, 2020

View reviewed changes

kba modified the milestones: Final workshop, 3.0.0 Jan 23, 2020

kba closed this Jun 9, 2020

bertsky mentioned this pull request Jun 30, 2021

ocrd_page with_etree and improved docstrings #699

Closed

bertsky mentioned this pull request Feb 2, 2022

OcrdPage: need a mechanism to avoid segment ID clashes #793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Generateds with etree #313

[WIP] Generateds with etree #313

kba commented Sep 13, 2019

codecov-io commented Sep 13, 2019 •

edited

Loading

bertsky commented Sep 13, 2019

kba commented Oct 8, 2019

bertsky commented Oct 9, 2019

bertsky commented Jan 7, 2020

kba commented Jan 7, 2020

bertsky commented Jan 7, 2020 •

edited

Loading

bertsky commented Jan 22, 2020

bertsky commented Jan 22, 2020

kba commented Jan 22, 2020

cneud Jan 22, 2020

bertsky Jan 23, 2020

kba Jan 23, 2020

cneud Jan 22, 2020

kba Jan 23, 2020

bertsky left a comment

bertsky Jan 23, 2020

kba commented Jun 9, 2020

		@@ -61,6 +67,14 @@

		from .constants import NAMESPACES

		def di(ID):

[WIP] Generateds with etree #313

[WIP] Generateds with etree #313

Conversation

kba commented Sep 13, 2019

codecov-io commented Sep 13, 2019 • edited Loading

Codecov Report

bertsky commented Sep 13, 2019

kba commented Oct 8, 2019

bertsky commented Oct 9, 2019

bertsky commented Jan 7, 2020

kba commented Jan 7, 2020

bertsky commented Jan 7, 2020 • edited Loading

bertsky commented Jan 22, 2020

bertsky commented Jan 22, 2020

kba commented Jan 22, 2020

cneud Jan 22, 2020

Choose a reason for hiding this comment

bertsky Jan 23, 2020

Choose a reason for hiding this comment

kba Jan 23, 2020

Choose a reason for hiding this comment

cneud Jan 22, 2020

Choose a reason for hiding this comment

kba Jan 23, 2020

Choose a reason for hiding this comment

bertsky left a comment

Choose a reason for hiding this comment

bertsky Jan 23, 2020

Choose a reason for hiding this comment

kba commented Jun 9, 2020

codecov-io commented Sep 13, 2019 •

edited

Loading

bertsky commented Jan 7, 2020 •

edited

Loading