-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Generateds with etree #313
Conversation
Codecov Report
@@ Coverage Diff @@
## master #313 +/- ##
==========================================
- Coverage 91.18% 82.44% -8.74%
==========================================
Files 30 39 +9
Lines 1610 2353 +743
Branches 308 424 +116
==========================================
+ Hits 1468 1940 +472
- Misses 107 340 +233
- Partials 35 73 +38
Continue to review full report at Codecov.
|
Very interesting! I can see a
Splendid! Yes, maybe we could look for examples of APIs for similar schemas? The original Java API will probably not help, because it is too far away from the XSD identifiers. But a bare etree API is inadequate, too (e.g.
I am afraid I do not understand that really. My original idea (in #240) was to use generateDS's user methods to supplement the code, not the runtime lxml injection. |
Say you have This could be accomplished with generateDS user methods, so you could do e.g. The other option would be to extend/monkey-patch just the p = page_from_file(...)
# traverse with PAGE API until TextLineType level
e = p.getElement(t)
t = p.getObject(e)
# or with a function:
e = getElement(t)
t = getObject(t, p) If you can do that in both directions, would that not be enough? |
Oh, okay, so this was already about how to avoid runtime injection, great.
Exactly my question earlier above (from the etree to the DOM). I cannot believe lxml would not offer this.
All 3 options to get the parent look viable (although I would prefer the first). I even think we should not expose the etree unnecessarily, it would be sufficient to have But we do not need the other direction, because we have What we do still need (as already mentioned in #240) though, is:
|
Was not intended but is going to stay. Do you need any API beyond what generateDS provides with |
Fabulous!
Yes (but not quite as urgent): as mentioned above...
Especially reading order handling is hard to get right, because of the many options in PAGE-XML (ordered vs unordered inside ordered vs unordered, recursive). If we want to get correct+complete implementations for this (not just top-level ordered group), we should offer a simple to use abstraction soon. For an example of (how I believe) how it should be done, see here. |
Additional demand:
(A first use-case for this would be ocrd-segment-repair with correct-coords going through the |
Or should we rather enrich these exception classes by references to the layout element objects? |
I've implemented a rudimentary extension of the generateDS code to allow
The API will have to change but at least this gives developers the possibility to use xpath and get elements by ID. Once we can update generateDS, we should invest the time to create a proper integration by subclassing the generated classes. @bertsky Can you live with this for the time being and does this allow you to move forward? |
@@ -61,6 +67,14 @@ | |||
|
|||
from .constants import NAMESPACES | |||
|
|||
def di(ID): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated code, monkey patch and then this - this PR clearly presents a path to the dark side...
More seriously, please convince me this is necessary yet safe enough to use here and not encounter side effects as also mentioned in the SO post comments. A very positive remark from the pilot libraries testing was that there didn't occur any crashes really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.
On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id
does look like the ultimate prize for writing better processors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could patch generateDS to keep actual references instead of memory adress srepresentations as the mapping key (i.e. replace the id(obj)
calls with just obj
). The intention with id()
is to not prevent GC from freeing memory of references to generateDS classes that would be kept around as long as the mapping dict remains in scope. However, this is only in one direction, the reverse mapping uses etree Elements as key. And if anything changes in the PAGE via API, these references should be invalidated because the Elements do not reflect the state of the PAGE document anymore. If no dict is passed for the mapping_
parameter, it uses a method-local dict to hold the references and this should not be a problem since the references would go out of scope immediately.
For the "search-by-id" and "iterate-over-all-region-types" operation, walking through the object tree every time nstead of using an index that can become invalidated all the time might be a worthy tradeoff. Essentially, reimplementing ET.findall
.
In any case, monkey-patching is a bad idea, obviously, it's brittle and becomes invalid for every change to the document.
self.mapping_obj2el = {} | ||
self.root_el = self.root_obj.to_etree(None, name_='PcGts', mapping_=self.mapping_obj2el) | ||
self.mapping_el2obj = dict(((v, k) for k, v in self.mapping_obj2el.items())) | ||
LOG.debug("OcrdPageExt.update_mappings took %ss", perf_counter()-t0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious whether you tested how this affects performance - thinking of e.g. large documents like newspapers or pages with polygonal glyph segmentation that can have thousands of elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_mappings
will degrade performance, I cannot say by how much. But certainly enough not to call it every time the document changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were any other way...
I recommend to postpone this for our 3.0 milestone, so we can test thoroughly.
In the meantime, for coordinate validation/repair and reading order iterators/repair, we should patch our classes with simple stupid id2obj dicts as needed.
@@ -61,6 +67,14 @@ | |||
|
|||
from .constants import NAMESPACES | |||
|
|||
def di(ID): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goodness, this is becoming serious! I agree with @cneud this might be a Pandorra's box and cost us stability and debuggability.
On the other hand, I am unqualified to suggest any alternatives for getting back to object references from etree nodes. This seems indeed to be the missing link. And get_obj_by_id
does look like the ultimate prize for writing better processors.
I think we should close this. We have a region iterator now and the |
@bertsky Here's a quick shot at getting etree references into the generated PAGE API, so you can access parent elements/objects, search the tree with xpath etc.
This is only the proof of concept how to set this up, we still need to discuss an API.
We could dig into generateDS some more and find out how to customize the element classes (
get_element
) or add methods on the root object (get_element_for_object
). I'd prefer the latter because it would also allow easyget_object_for_element
).