-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor result object #8
Processor result object #8
Conversation
OCR-D/ocrd_kraken#44 is adapted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this and spotting all these typing errors!
I would prefer calling the new class OcrdPageResult
instead of OcrdProcessResult
, because
- this is about PcGts / OcrdPage objects primarily
- this is about the single-page function, while "processing" and "processor" in general refer to workspace operations
- we have no OcrdProcess
Also, I wonder if it is really necessary to raise this to ocrd_models
– it's meant to be an internal interface between ocrd.Processor.process_page_pcgts
and ocrd.Processor.process_page_file
.
Instead of making images
a list of tuples again, why not define a data class with members like pil
/ file_id
/ file_path
?
More importantly, let's go one step further and
- replace the file ID with just the file ID suffix to be added to the PcGts file ID (that way,
process_page_pcgts
does not need to know about the output file ID at all) - replace the image path with the reference to the generated/annotated AlternativeImageType in the resulting PcGtsType, so the calling
process_page_file
can simply set itspathname
after writing the image file (that way,process_page_pcgts
does not need to know the output file path in advance)
BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.
|
No reason now, I just like that all the "dumb" data classes are in one spot. But I'll move it/reimplement it closer to |
Done and OCR-D/ocrd_kraken#44 adapted accordingly. Now to change the interface. |
…e_id with OcrdPageResult.file_id_suffix
OK, I think I have everything together now, interface-wise. Now adapting kraken, looking forward to simplify binarize in particular ;) |
I think we can go even further with simplifying the handling of alternative images, but I'll do that after the |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
input_pcgts : List[Optional[OcrdPage]] = [None] * len(input_files) | ||
assert isinstance(input_files[0], (OcrdFile, ClientSideOcrdFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Optional
(also in function prototype)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here: Because we're instantiating a list of None
values, which are not OcrdPage
.
In the function signature of process_page_pcgts
: Same situation, there might be "holes" in the list of input_pcgts
when any of the input_files
in process_page_files
cannot be parsed as PAGE-XML.
And for process_page_files
: The input_files
can be hole-y, if the workspace.download_file
fails for any of the files (beyond the first?).
But really, I was trying to make sure that static type checking had no more complaints. I tried to add assert
statements where I know that variables must be defined or of a certain type to mitigate the "everything might be None
" problem somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, I forgot about the holes returned by zip_input_files for multiple fileGrps but incomplete PAGE-XML coverage per page!
Maybe we should document this more loudly.
input_pcgts[i] = page_from_file(input_file) | ||
page_ = page_from_file(input_file) | ||
assert isinstance(page_, PcGtsType) | ||
input_pcgts[i] = page_ | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except ValueError as e: | |
except (AssertionError, ValueError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen, ie. can page_from_file(with_etree=False)
ever return anything other than a PcGtsType
? I think if that was ever the case, we'd want that AssertionError
to be raised because then we'd have broken something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right – it cannot happen. But then what is the assertion good for – satisfying the type checker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, my curiosity that I understand the behavior correctly. But secondly, yes, the type checker ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But reading this again, I should have used OcrdPage
not PcGtsType
, which is just an alias but we use OcrdPage
in the method typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Feel free to change in OCR-D#1240.
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
…ive_image Co-authored-by: Robert Sachunsky <[email protected]>
… into processor-result-object
Merged into OCR-D#1240 for the |
no need to – thanks! |
update ocrd-cis-binarize to be compatible with bertsky/core#8
Just a quick draft, to be refined tomorrow.