Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse fails to validate result of to_xml #269

Closed
bertsky opened this issue Aug 2, 2019 · 17 comments
Closed

parse fails to validate result of to_xml #269

bertsky opened this issue Aug 2, 2019 · 17 comments
Assignees
Labels

Comments

@bertsky
Copy link
Collaborator

bertsky commented Aug 2, 2019

I get a regression with 1.0.0b11: The call to page_from_file fails at ocrd_models_generateds.parse on a file previously generated by ocrd_models.ocrd_page.to_xml. (It mocks in validate_ConfSimpleType that the value is a str instead of a number.)

This is what I did:

ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP

where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.

This is what happens:

16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
  File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
    load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
  File "click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
    return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
  File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/base.py", line 65, in run_processor
    processor.process()
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
    pcgts = page_from_file(self.workspace.download_file(input_file))
  File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
    return parse(input_file.local_filename, silence=True)
  File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
    rootObj.build(rootNode)
  File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
    self.buildAttributes(node, node.attrib, already_processed)
  File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
    self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType
  File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
    if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'

The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15.

@bertsky bertsky added the bug label Aug 2, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Aug 2, 2019

This is a real showstopper. It effectively breaks all further processing of OCR results. And ocrd_tesserocr master is now dependent on b11...

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 2, 2019

NB: JPageViewer 1.3 does render the file correct after replacing 2019 with 2018 and removing Page/@orientation.

@wrznr Have you experienced anything similar yet?

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 2, 2019

BTW, it does help to manually remove all TextEquiv/@conf.

@kba
Copy link
Member

kba commented Aug 2, 2019

Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed...

@mikegerber
Copy link
Contributor

I have the same problem, using ocrd-tesserocr. Workaround:

xmlstarlet ed --inplace \
  -N 'page=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15' \
  -d '//page:TextEquiv/@conf' OCR-D-OCR-TESS/*

@kba
Copy link
Member

kba commented Aug 5, 2019

The pertinent diff in the generated code:

-            try:
-                self.conf = float(value)
-            except ValueError as exp:
-                raise ValueError('Bad float/double attribute (conf): %s' % exp)
+            self.conf = value
+            self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType

There is not more casting to float in the current code. Hence all of

set_conf("1")
set_conf(int(1))
set_conf(1.0)

are accepted and stored as str, int and float as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.

@kba
Copy link
Member

kba commented Aug 5, 2019

Problem first appeared in the 2.31.1 release. I could not find a setting to make this configurable, so for now I'll revert generateDS to 2.30.11 and publish another beta 12 that is the same except for how the PAGE API is generated.

@kba
Copy link
Member

kba commented Jan 22, 2020

I see lots of fixes for conversion between xsd: types and python primitives in generateDS 2.35.9. I won't update the generated code now because regressions from this are the last thing we need at the moment but we will revisit and fix this as soon as the final workshop is over.

kba added a commit to kba/ocrd-core that referenced this issue Feb 13, 2020
@kba
Copy link
Member

kba commented Feb 13, 2020

I've regenerated the PAGE API in #437 with generateDS 2.35.13 and the type issues are fixed. I've tried to recreate your initial problem and could not with test-269.zip. @bertsky Can you try #437 and/or have any pointers what I should test for to avoid future regressions?

kba added a commit to kba/ocrd-core that referenced this issue Feb 18, 2020
kba added a commit that referenced this issue Mar 12, 2020
This reverts commit 3a0a3a8.

Conflicts:
	tests/model/test_ocrd_page.py
@kba
Copy link
Member

kba commented Apr 29, 2020

@bertsky can this be closed?

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2020

I am afraid the current version now (due to the missing NS prefix) mixes elements with prefix (unchanged from input) and without (new elements), which our validator checks fine but PageViewer rejects. Open a new issue?

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2020

which our validator checks fine

But in fact these are invalid, because no prefix is only allowed when you have an xmlns=DEFAULT-NS-URL in the header.

but PageViewer rejects

PageViewer is okay with core-generated PAGE-XML when I add a default xmlns.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2020

Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2020

@kba Since #443 is already merged, this is urgent.

@kba
Copy link
Member

kba commented Apr 29, 2020

@kba Since #443 is already merged, this is urgent.

OK, I'm looking into it. Namespace prefixes be damned.

Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...

That is strange. Are you sure you did git pull --tags? Our releases are always based on a tag.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2020

That is strange. Are you sure you did git pull --tags? Our releases are always based on a tag.

Oh sorry – you're right of course. I did not. (I was under the impression that they are fetched automatically, and I have to disable that via --no-tags. Turns out these are different 'kinds' of tag. Stupid git interfaces – I used to be so happy with mercurial...)

@bertsky
Copy link
Collaborator Author

bertsky commented May 15, 2020

Solved by #474 (but hopefully also upstream in generateDS some day).

@bertsky bertsky closed this as completed May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants