Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image file is truncated (21 bytes not processed) #655

Closed
gabays opened this issue Nov 1, 2024 · 1 comment
Closed

image file is truncated (21 bytes not processed) #655

gabays opened this issue Nov 1, 2024 · 1 comment

Comments

@gabays
Copy link

gabays commented Nov 1, 2024

Hello,

Some images were causing problems during the compilation and the script was crashing.

Extracting lines ━━━━━━━━━━━━━━━━              58% 374794/647835 0:31:07 0:20:44
RemoteTraceback: 
"""
Traceback (most recent call last):
  File 
"/opt/ebsofts/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.p
y", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/arrow_dataset.py", line 111, in _extract_line


  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/arrow_dataset.py", line 85, in _extract_line
    if is_bitonal(im):
       ^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/util.py", line 57, in is_bitonal
    return im.getcolors(2) is not None and len(im.getcolors(2)) == 2
           ^^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/PI
L/Image.py", line 1438, in getcolors
    self.load()
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/PI
L/ImageFile.py", line 297, in load
    raise OSError(msg)
OSError: image file is truncated (16 bytes not processed)
"""

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/users/g/gabays/build_modern_v2/kraken-env/bin/ketos:8 in <module>      │
│                                                                              │
│   5 from kraken.ketos import cli                                             │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(cli())                                                      │
│   9                                                                          │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1157 in __call__                                              │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1078 in main                                                  │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1688 in invoke                                                │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1434 in invoke                                                │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:783 in invoke                                                 │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/decorators.py:33 in new_func                                          │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /kraken/ketos/dataset.py:92 in compile                                       │
│                                                                              │
│    89 │   │   │   │   progress.start_task(extract_task)                      │
│    90 │   │   │   progress.update(extract_task, total=total, advance=advance │
│    91 │   │                                                                  │
│ ❱  92 │   │   arrow_dataset.build_binary_dataset(ground_truth,               │
│    93 │   │   │   │   │   │   │   │   │   │      output,                     │
│    94 │   │   │   │   │   │   │   │   │   │      format_type,                │
│    95 │   │   │   │   │   │   │   │   │   │      workers,                    │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /kraken/lib/arrow_dataset.py:299 in build_binary_dataset                     │
│                                                                              │
│   296 │   │   │   │   if num_workers and num_workers > 1:                    │
│   297 │   │   │   │   │   logger.info(f'Spinning up processing pool with {nu │
│   298 │   │   │   │   │   with Pool(num_workers) as pool:                    │
│ ❱ 299 │   │   │   │   │   │   for page_lines, im_mode in pool.imap_unordered │
│   300 │   │   │   │   │   │   │   if page_lines:                             │
│   301 │   │   │   │   │   │   │   │   line_cache.extend(page_lines)          │
│   302 │   │   │   │   │   │   │   │   # comparison RGB(A) > L > 1            │
│                                                                              │
│ /opt/ebsofts/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/poo │
│ l.py:873 in next                                                             │
│                                                                              │
│   870 │   │   success, value = item                                          │
│   871 │   │   if success:                                                    │
│   872 │   │   │   return value                                               │
│ ❱ 873 │   │   raise value                                                    │
│   874 │                                                                      │
│   875 │   __next__ = next                    # XXX                           │
│   876                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: image file is truncated (16 bytes not processed)
srun: error: cpu132: task 0: Exited with exit code 1

I had to add a try/except to the _extract_Line() in arrow_dataset.py:

def _extract_line(xml_record, skip_empty_lines: bool = True, legacy_polygons: bool = False):
    lines = []
    try:
        im = Image.open(xml_record.imagename)
    except (FileNotFoundError, UnidentifiedImageError):
        return lines, None, None
    try:
        if is_bitonal(im):
            im = im.convert('1')
        for idx, rec in enumerate(xml_record.lines):
            seg = Segmentation(text_direction='horizontal-lr',
                               imagename=xml_record.imagename,
                               type=xml_record.type,
                               lines=[rec],
                               regions=None,
                               script_detection=False,
                               line_orders=[])
            try:
                line_im, line = next(extract_polygons(im, seg, legacy=legacy_polygons))
            except KrakenInputException:
                logger.warning(f'Invalid line {idx} in {xml_record.imagename}')
                continue
            except Exception as e:
                logger.warning(f'Unexpected exception {e} from line {idx} in {xml_record.imagename}')
                continue
            if not line.text and skip_empty_lines:
                continue
            fp = io.BytesIO()
            line_im.save(fp, format='png')
            lines.append({'text': line.text, 'im': fp.getvalue()})
    except Exception as e:
        with open('debug_error.txt', 'a') as debug_error_f:
            debug_error_f.write(str(e)+'\n')
            debug_error_f.write(str(xml_record.imagename)+'\n')
        logger.error(f'Unexpected exception {e} in {xml_record.imagename}')
        #raise e
    return lines, im.mode

I guess my hack is far from being perfect, but maybe it could lead to a better way to deal with the problem in the future.

Best,

Simon

@mittagessen
Copy link
Owner

mittagessen commented Nov 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants