-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse fails to validate result of to_xml #269
Comments
This is a real showstopper. It effectively breaks all further processing of OCR results. And ocrd_tesserocr master is now dependent on b11... |
NB: JPageViewer 1.3 does render the file correct after replacing 2019 with 2018 and removing @wrznr Have you experienced anything similar yet? |
BTW, it does help to manually remove all |
Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed... |
I have the same problem, using ocrd-tesserocr. Workaround:
|
The pertinent diff in the generated code: - try:
- self.conf = float(value)
- except ValueError as exp:
- raise ValueError('Bad float/double attribute (conf): %s' % exp)
+ self.conf = value
+ self.validate_ConfSimpleType(self.conf) # validate type ConfSimpleType There is not more casting to float in the current code. Hence all of set_conf("1")
set_conf(int(1))
set_conf(1.0) are accepted and stored as |
Problem first appeared in the 2.31.1 release. I could not find a setting to make this configurable, so for now I'll revert generateDS to 2.30.11 and publish another beta 12 that is the same except for how the PAGE API is generated. |
I see lots of fixes for conversion between |
I've regenerated the PAGE API in #437 with generateDS 2.35.13 and the type issues are fixed. I've tried to recreate your initial problem and could not with test-269.zip. @bertsky Can you try #437 and/or have any pointers what I should test for to avoid future regressions? |
This reverts commit 3a0a3a8. Conflicts: tests/model/test_ocrd_page.py
@bertsky can this be closed? |
I am afraid the current version now (due to the missing NS prefix) mixes elements with prefix (unchanged from input) and without (new elements), which our validator checks fine but PageViewer rejects. Open a new issue? |
But in fact these are invalid, because no prefix is only allowed when you have an
PageViewer is okay with core-generated PAGE-XML when I add a default xmlns. |
Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ... |
OK, I'm looking into it. Namespace prefixes be damned.
That is strange. Are you sure you did |
Oh sorry – you're right of course. I did not. (I was under the impression that they are fetched automatically, and I have to disable that via |
Solved by #474 (but hopefully also upstream in generateDS some day). |
I get a regression with 1.0.0b11: The call to
page_from_file
fails atocrd_models_generateds.parse
on a file previously generated byocrd_models.ocrd_page.to_xml
. (It mocks invalidate_ConfSimpleType
that the value is astr
instead of a number.)This is what I did:
ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP
where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.
This is what happens:
The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
.The text was updated successfully, but these errors were encountered: