Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of AlternativeImage and Regions #16

Closed
mjenckel opened this issue Oct 31, 2019 · 6 comments
Closed

Use of AlternativeImage and Regions #16

mjenckel opened this issue Oct 31, 2019 · 6 comments

Comments

@mjenckel
Copy link
Collaborator

mjenckel commented Oct 31, 2019

With 44247ab we tried to add AlternativeImage functionality to "binarize", "deskew", "cropping" and "dewarp". However there were some questions whether we used AlternativeImage correctly. Currently each Module expects the PAGE-XML output of the previous model as input, adds a new alternative image + eventual XML output (orientation, border coords) to the PAGE-XML and saves it as a new output. For this it expects two output folders, one for the AlternativeImg output, one of the output of the new PAGE-XML. If they are the same there will be an error about already existing files (IMG and PAGE file will have the same fileID).
Alongside this there was also a question about AlternativeImage regions. While deskewing and cropping only has very limited application to regions, binarization and dewarping on regions might be useful. Due to the limitations of pix2pixHD, dewarping requires files on the HD rather than just images in PIL format. This works for AlternativeImages, but the question remains if AlternativeImage Regions also exist on the hard drive or are created from the original image as required.

@wrznr
Copy link

wrznr commented Nov 4, 2019

While deskewing and cropping only has very limited application to regions

I agree for cropping. I would even say that cropping in the sense of page frame detection (i.e. the task your processor solves very, very good) should only by applicable on page level. You should not delay your self with possible notions of cropping on lower levels. However, deskewing has a very prominent application on region level: https://github.com/wrznr/IT-Kolloquium-2019/blob/master/img/deskewing_ex2.svg

but the question remains if AlternativeImage Regions also exist on the hard drive

Each file referenced by an AlternativeImage instance exists on the hard drive (in the OCR-D context).

@kba
Copy link
Member

kba commented Nov 4, 2019

but the question remains if AlternativeImage Regions also exist on the hard drive

Definitely on page-level but not necessarily on region level. Otherwise those region images would have to be mentioned in METS, increasing file size and leading to confusion and inconsistency pretty quickly. If they cannot be recreated from a resp. page-level AlternativeImage (as in the use case for deskewing given by @wrznr) that is a problem and these images indeed have to be stored on HDD.

I would very much prefer to have every pg:Page/imageFilename or pg:AlternativeImage/filename attribute manifested but I fear that this will lead to massive I/O overhead.

@bertsky @wrznr @cneud Any ideas for a compromise?

@mjenckel
Copy link
Collaborator Author

mjenckel commented Nov 5, 2019

Alternatively, if a region doesnt have a filepath to an AlternativeImage, we could generate a temporary file and go from there. Its just a question of what the preferred functionality is.

@wrznr
Copy link

wrznr commented Nov 5, 2019

After some discussions, we can state the following: We need AlternativeImage on the sub-page level a) to reduce the complexity of the whole recognition process (i.e. it is not good if every processor computes its input image from the page-level images) and b) to allow the representation of results which (currently) can only be represented using an image (and not by some meta description, namely dewarping). Since https://ocr-d.github.io/mets#if-in-page-then-in-mets is adamantite, the images have to exist on the hard drive, regardless of possibly massive I/O overhead.

@mjenckel Is this sufficient to answer your initial question?

@mjenckel
Copy link
Collaborator Author

mjenckel commented Nov 5, 2019

Yes, thank you very much!

@mjenckel mjenckel closed this as completed Nov 5, 2019
@bertsky
Copy link
Contributor

bertsky commented Nov 6, 2019

If they are the same there will be an error about already existing files (IMG and PAGE file will have the same fileID).

@mjenckel That depends on how you make up new fileIDs. You can easily (and w.r.t. possible inconsistencies, profitably) put PAGE output and derived images in the same file group, as long as they receive systematically different IDs. (And for the annotation levels below page, you would never have a clash.)

I would very much prefer to have every pg:Page/imageFilename or pg:AlternativeImage/filename attribute manifested but I fear that this will lead to massive I/O overhead.

@kba IMO it's the other way round: If you have a step that can produce derived images but also describe that operation exhaustively via attributes in PAGE, which is sufficient to reproduce the same derived images later-on, then not annotating those images in the producer may make it necessary for every single consumer to repeat that computation. Especially on lower hierarchy levels (e.g. deskewing a region over and over again for each text line that is needed). So we have (some) I/O overhead vs (n-fold) CPU overhead.

@bertsky bertsky mentioned this issue Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants