Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix workspace handling of local files #342

Closed
wrznr opened this issue Nov 6, 2019 · 2 comments
Closed

Fix workspace handling of local files #342

wrznr opened this issue Nov 6, 2019 · 2 comments
Assignees
Labels

Comments

@wrznr
Copy link
Contributor

wrznr commented Nov 6, 2019

As shown in the Lobby, I encounter problems when trying to create a workspace from existing (local) files:

$ ocrd workspace init .
$ ocrd workspace add 00009p.xml -G GT -i 00009_gt -g 00009 -m 'application/vnd.prima.page+xml'
$ ocrd workspace add 00009.tif -G IMG -i 00009_img -g 00009 -m 'image/tiff'

Running ocrd-tesserocr-binarize leads to

$ ocrd-tesserocr-binarize -I GT -O BIN -p '{"operation_level": "line"}'
10:46:23.324 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
10:46:23.442 INFO processor.TesserocrBinarize - INPUT FILE 0 / 00009
Traceback (most recent call last):
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 109, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 00009.tif

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kmw/Documents/Work/OCR-D/env/bin/ocrd-tesserocr-binarize", line 10, in <module>
    sys.exit(ocrd_tesserocr_binarize())
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 45, in ocrd_tesserocr_binarize
    return ocrd_cli_wrap_processor(TesserocrBinarize, *args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/decorators.py", line 66, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/processor/base.py", line 56, in run_processor
    processor.process()
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd_tesserocr/binarize.py", line 82, in process
    page, page_id)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 320, in image_from_page
    page_image = self._resolve_image_as_pil(page.imageFilename)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 237, in _resolve_image_as_pil
    image_filename = self.download_file(f).local_filename
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 112, in download_file
    raise Exception("No baseurl defined by workspace. Cannot retrieve '%s'" % f.url)
Exception: No baseurl defined by workspace. Cannot retrieve '00009.tif'

In addition, a file 00009_gt.xml is created in the GT directory.

00009.zip

@wrznr wrznr assigned kba Nov 6, 2019
@wrznr wrznr added the bug label Nov 6, 2019
@wrznr wrznr closed this as completed Nov 6, 2019
@kba
Copy link
Member

kba commented Nov 6, 2019

The problem stands that input image should not be copied at all, needs investigating.

@kba
Copy link
Member

kba commented Jan 13, 2020

Another flaw in the logic of download_to_directory. It SHOULD recognize that the source files are already in the workspace but does not, leading to copies of all input files...

kba added a commit to kba/ocrd-core that referenced this issue Jan 13, 2020
kba added a commit to kba/ocrd-core that referenced this issue Jan 14, 2020
kba added a commit to kba/ocrd-core that referenced this issue Jan 14, 2020
kba added a commit to kba/ocrd-core that referenced this issue Jan 14, 2020
@kba kba closed this as completed in 6e4f633 Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants