`max() arg is an empty sequence` error on PDFs with blank pages, is there a skip empty page option? #179

aflip · 2020-08-10T12:17:20Z

Camelot is great! Thank you
I'm using Camelot to extract IDSP data. I am a physician, trying to see if an epidemic calendar can be made using this data.
When there are files which have empty pages after the tables, instead of skipping the page or moving on to the next page, camelot aborts that run and throws up ValueError: max() arg is an empty sequence

The PDF that triggers this is also attached.
5.pdf

When row_tol is not specified, it throws up an error, but parses the file and extracts the other tables like so

but once the row_tol, is set, it doesn't give me the other tables.

So, if I have a feature that lets me skip the empty pages, that would help.

Because In some PDFs there are a few empty pages between the tables, and when I'm processing thousands of PDFs, it's impossible to keep changing the parameters for each one.

To reproduce:

Use stream with row_tol or other parameters on an empty page in the pdf.

System:

Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) 
[GCC 7.5.0]
NumPy 1.1.1
OpenCV 4.4.0
Camelot 0.8.2

Full error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-437e7b144dac> in <module>
----> 1 tables= cm.read_pdf('data/2-end/5.pdf', pages='2-end', flavor = 'stream', row_tol=55)

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    115             suppress_stdout=suppress_stdout,
    116             layout_kwargs=layout_kwargs,
--> 117             **kwargs
    118         )
    119         return tables

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    170             for p in pages:
    171                 t = parser.extract_tables(
--> 172                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    173                 )
    174                 tables.extend(t)

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    455             sorted(self.table_bbox.keys(), key=lambda x: x[1], reverse=True)
    456         ):
--> 457             cols, rows = self._generate_columns_and_rows(table_idx, tk)
    458             table = self._generate_table(table_idx, cols, rows)
    459             table._bbox = tk

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in _generate_columns_and_rows(self, table_idx, tk)
    346             # calculate mode of the list of number of elements in
    347             # each row to guess the number of columns
--> 348             ncols = max(set(elements), key=elements.count)
    349             if ncols == 1:
    350                 # if mode is 1, the page usually contains not tables

ValueError: max() arg is an empty sequence

The text was updated successfully, but these errors were encountered:

vinayak-mehta · 2020-08-25T17:34:19Z

@aflip Thanks for the detailed issue report! This should be fixed in the next release. 💚 💙 💜 💛 ❤️

aflip changed the title ~~[Feature Request ] Skip empty page option to avoid max() arg is an empty sequence error~~ max() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option? Aug 11, 2020

pevisscher mentioned this issue Aug 25, 2020

Prevent taking the max of an empty set #187

Closed

vinayak-mehta mentioned this issue Aug 25, 2020

[MRG] Prevent taking max of an empty set #189

Merged

vinayak-mehta closed this as completed in #189 Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`max() arg is an empty sequence` error on PDFs with blank pages, is there a skip empty page option? #179

`max() arg is an empty sequence` error on PDFs with blank pages, is there a skip empty page option? #179

aflip commented Aug 10, 2020 •

edited

Loading

vinayak-mehta commented Aug 25, 2020

max() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option? #179

max() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option? #179

Comments

aflip commented Aug 10, 2020 • edited Loading

vinayak-mehta commented Aug 25, 2020

`max() arg is an empty sequence` error on PDFs with blank pages, is there a skip empty page option? #179

`max() arg is an empty sequence` error on PDFs with blank pages, is there a skip empty page option? #179

aflip commented Aug 10, 2020 •

edited

Loading