Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Generateds with etree #313

Closed
wants to merge 5 commits into from
Closed

Conversation

kba
Copy link
Member

@kba kba commented Sep 13, 2019

@bertsky Here's a quick shot at getting etree references into the generated PAGE API, so you can access parent elements/objects, search the tree with xpath etc.

This is only the proof of concept how to set this up, we still need to discuss an API.

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

@codecov-io
Copy link

codecov-io commented Sep 13, 2019

Codecov Report

Merging #313 into master will decrease coverage by 8.73%.
The diff coverage is 69.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
- Coverage   91.18%   82.44%   -8.74%     
==========================================
  Files          30       39       +9     
  Lines        1610     2353     +743     
  Branches      308      424     +116     
==========================================
+ Hits         1468     1940     +472     
- Misses        107      340     +233     
- Partials       35       73      +38
Impacted Files Coverage Δ
..._validators/ocrd_validators/parameter_validator.py 100% <ø> (ø) ⬆️
ocrd/ocrd/__init__.py 100% <ø> (ø) ⬆️
ocrd_utils/ocrd_utils/constants.py 100% <ø> (ø) ⬆️
ocrd/ocrd/cli/zip.py 67.44% <0%> (ø)
ocrd/ocrd/processor/__init__.py 100% <100%> (ø)
ocrd/ocrd/resolver.py 98.85% <100%> (+0.05%) ⬆️
ocrd_utils/ocrd_utils/logging.py 100% <100%> (ø) ⬆️
ocrd/ocrd/workspace_bagger.py 100% <100%> (ø) ⬆️
ocrd/ocrd/cli/validate.py 100% <100%> (ø)
ocrd/ocrd/cli/__init__.py 100% <100%> (ø)
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db55b59...8ee1e67. Read the comment docs.

@bertsky
Copy link
Collaborator

bertsky commented Sep 13, 2019

Very interesting! I can see a parent_object_ member in the direct model now, and lots of nice functions like getparent() and other getters, iterators and so on lxml model, i.e. the object from to_etree(). (But how do I get back from the etree to the DOM?)

This is only the proof of concept how to set this up, we still need to discuss an API.

Splendid! Yes, maybe we could look for examples of APIs for similar schemas?

The original Java API will probably not help, because it is too far away from the XSD identifiers. But a bare etree API is inadequate, too (e.g. tei-reader simply exposes the lxml etree via .node).

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

I am afraid I do not understand that really. My original idea (in #240) was to use generateDS's user methods to supplement the code, not the runtime lxml injection.

@kba
Copy link
Member Author

kba commented Oct 8, 2019

We could dig into generateDS some more and find out how to customize the element classes (get_element) or add methods on the root object (get_element_for_object). I'd prefer the latter because it would also allow easy get_object_for_element).

I am afraid I do not understand that really. My original idea (in #240) was to use generateDS's user methods to supplement the code, not the runtime lxml injection.

Say you have TextLineType t object and want to get the etree element e for that object, to execute an Xpath or whatever.

This could be accomplished with generateDS user methods, so you could do e.g. e = t.getElement(). But how do you get t back from e?

The other option would be to extend/monkey-patch just the PcGtsType or use pure functions:

p = page_from_file(...)
# traverse with PAGE API until TextLineType level
e = p.getElement(t)
t = p.getObject(e)
# or with a function:
e = getElement(t)
t = getObject(t, p)

If you can do that in both directions, would that not be enough?

@bertsky
Copy link
Collaborator

bertsky commented Oct 9, 2019

This could be accomplished with generateDS user methods, so you could do e.g. e = t.getElement().

Oh, okay, so this was already about how to avoid runtime injection, great.

But how do you get t back from e?

Exactly my question earlier above (from the etree to the DOM). I cannot believe lxml would not offer this.

If you can do that in both directions, would that not be enough?

All 3 options to get the parent look viable (although I would prefer the first).

I even think we should not expose the etree unnecessarily, it would be sufficient to have getParent() or pcgts.getParentFor(hierarchy_element) or getParentFor(hierarchy_element, pcgts).

But we do not need the other direction, because we have PcGtsType.get_Page(), PageType.get_TextRegion(), TextRegionType.get_TextLine(), TextLineType.get_Word(), WordType.get_Glyph() already.

What we do still need (as already mentioned in #240) though, is:

  • a general PageType.get_Region() (regardless of type)
  • a region iterator along the (recursive) reading order

@bertsky
Copy link
Collaborator

bertsky commented Jan 7, 2020

@kba we already have parent_object_ members ever since a131d14, e.g.

self.parent_object_ = kwargs_.get('parent_object_')

Is this intended? Is it going to stay?

@kba
Copy link
Member Author

kba commented Jan 7, 2020

Is this intended? Is it going to stay?

Was not intended but is going to stay. Do you need any API beyond what generateDS provides with etree features retained in core/ocrd_models/ocrd_models/ocrd_page_generateds.py?

@bertsky
Copy link
Collaborator

bertsky commented Jan 7, 2020

Was not intended but is going to stay.

Fabulous!

Do you need any API beyond what generateDS provides with etree features retained in core/ocrd_models/ocrd_models/ocrd_page_generateds.py?

Yes (but not quite as urgent): as mentioned above...

What we do still need (as already mentioned in #240) though, is:

  • a general PageType.get_Region() (regardless of type)
  • a region iterator along the (recursive) reading order

Especially reading order handling is hard to get right, because of the many options in PAGE-XML (ordered vs unordered inside ordered vs unordered, recursive). If we want to get correct+complete implementations for this (not just top-level ordered group), we should offer a simple to use abstraction soon. For an example of (how I believe) how it should be done, see here.

@bertsky
Copy link
Collaborator

bertsky commented Jan 22, 2020

Additional demand:

  • something like obj2el() (either on the root PcGtsType or on any type) for easy lookup of elements by IDs without having to (recursively) search the tree

(A first use-case for this would be ocrd-segment-repair with correct-coords going through the page_validator.CoordinateValidityError and page_validator.CoordinateConsistencyError cases, trying to repair them one by one. These exceptions don't reference the actual element objects, but only their IDs.)

@kba kba added this to the Final workshop milestone Jan 22, 2020
@bertsky
Copy link
Collaborator

bertsky commented Jan 22, 2020

A first use-case for this would be ocrd-segment-repair with correct-coords going through the page_validator.CoordinateValidityError and page_validator.CoordinateConsistencyError cases, trying to repair them one by one. These exceptions don't reference the actual element objects, but only their IDs.

Or should we rather enrich these exception classes by references to the layout element objects?

@kba
Copy link
Member Author

kba commented Jan 22, 2020

I've implemented a rudimentary extension of the generateDS code to allow

  • getting etree elements for generateDS objects and vice versa
  • getting etree elements and generateDS objects by xml:id
  • iterating over all page:*RegionType elements

The API will have to change but at least this gives developers the possibility to use xpath and get elements by ID. Once we can update generateDS, we should invest the time to create a proper integration by subclassing the generated classes.

@bertsky Can you live with this for the time being and does this allow you to move forward?

@kba kba requested review from bertsky, cneud and wrznr January 22, 2020 18:28
@@ -61,6 +67,14 @@

from .constants import NAMESPACES

def di(ID):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated code, monkey patch and then this - this PR clearly presents a path to the dark side...

More seriously, please convince me this is necessary yet safe enough to use here and not encounter side effects as also mentioned in the SO post comments. A very positive remark from the pilot libraries testing was that there didn't occur any crashes really.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.

On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id does look like the ultimate prize for writing better processors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could patch generateDS to keep actual references instead of memory adress srepresentations as the mapping key (i.e. replace the id(obj) calls with just obj). The intention with id() is to not prevent GC from freeing memory of references to generateDS classes that would be kept around as long as the mapping dict remains in scope. However, this is only in one direction, the reverse mapping uses etree Elements as key. And if anything changes in the PAGE via API, these references should be invalidated because the Elements do not reflect the state of the PAGE document anymore. If no dict is passed for the mapping_ parameter, it uses a method-local dict to hold the references and this should not be a problem since the references would go out of scope immediately.

For the "search-by-id" and "iterate-over-all-region-types" operation, walking through the object tree every time nstead of using an index that can become invalidated all the time might be a worthy tradeoff. Essentially, reimplementing ET.findall.

In any case, monkey-patching is a bad idea, obviously, it's brittle and becomes invalid for every change to the document.

self.mapping_obj2el = {}
self.root_el = self.root_obj.to_etree(None, name_='PcGts', mapping_=self.mapping_obj2el)
self.mapping_el2obj = dict(((v, k) for k, v in self.mapping_obj2el.items()))
LOG.debug("OcrdPageExt.update_mappings took %ss", perf_counter()-t0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious whether you tested how this affects performance - thinking of e.g. large documents like newspapers or pages with polygonal glyph segmentation that can have thousands of elements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_mappings will degrade performance, I cannot say by how much. But certainly enough not to call it every time the document changes.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were any other way...

I recommend to postpone this for our 3.0 milestone, so we can test thoroughly.

In the meantime, for coordinate validation/repair and reading order iterators/repair, we should patch our classes with simple stupid id2obj dicts as needed.

@@ -61,6 +67,14 @@

from .constants import NAMESPACES

def di(ID):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.

On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id does look like the ultimate prize for writing better processors.

@kba kba modified the milestones: Final workshop, 3.0.0 Jan 23, 2020
@kba
Copy link
Member Author

kba commented Jun 9, 2020

I think we should close this. We have a region iterator now and the get_AllAlternativeImages user method can be used as a basis for methods that require an etree document. The solution here is an interesting proof-of-concept but not production-ready. Instead of a generalized etree/generateDS mapping solution with all its pitfalls, I recommend we extend the generateDS API with methods for specific use cases, like #510.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants