You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Camelot to extract IDSP data. I am a physician, trying to see if an epidemic calendar can be made using this data.
When there are files which have empty pages after the tables, instead of skipping the page or moving on to the next page, camelot aborts that run and throws up ValueError: max() arg is an empty sequence
The PDF that triggers this is also attached. 5.pdf
When row_tol is not specified, it throws up an error, but parses the file and extracts the other tables like so
but once the row_tol, is set, it doesn't give me the other tables.
So, if I have a feature that lets me skip the empty pages, that would help.
Because In some PDFs there are a few empty pages between the tables, and when I'm processing thousands of PDFs, it's impossible to keep changing the parameters for each one.
To reproduce:
Use stream with row_tol or other parameters on an empty page in the pdf.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-437e7b144dac> in <module>
----> 1 tables= cm.read_pdf('data/2-end/5.pdf', pages='2-end', flavor = 'stream', row_tol=55)
~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
115 suppress_stdout=suppress_stdout,
116 layout_kwargs=layout_kwargs,
--> 117 **kwargs
118 )
119 return tables
~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
170 for p in pages:
171 t = parser.extract_tables(
--> 172 p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
173 )
174 tables.extend(t)
~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
455 sorted(self.table_bbox.keys(), key=lambda x: x[1], reverse=True)
456 ):
--> 457 cols, rows = self._generate_columns_and_rows(table_idx, tk)
458 table = self._generate_table(table_idx, cols, rows)
459 table._bbox = tk
~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in _generate_columns_and_rows(self, table_idx, tk)
346 # calculate mode of the list of number of elements in
347 # each row to guess the number of columns
--> 348 ncols = max(set(elements), key=elements.count)
349 if ncols == 1:
350 # if mode is 1, the page usually contains not tables
ValueError: max() arg is an empty sequence
The text was updated successfully, but these errors were encountered:
aflip
changed the title
[Feature Request ] Skip empty page option to avoid max() arg is an empty sequence errormax() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option?
Aug 11, 2020
ValueError: max() arg is an empty sequence
The PDF that triggers this is also attached.
5.pdf
When
row_tol
is not specified, it throws up an error, but parses the file and extracts the other tables like sobut once the
row_tol
, is set, it doesn't give me the other tables.So, if I have a feature that lets me skip the empty pages, that would help.
Because In some PDFs there are a few empty pages between the tables, and when I'm processing thousands of PDFs, it's impossible to keep changing the parameters for each one.
To reproduce:
Use stream with
row_tol
or other parameters on an empty page in the pdf.System:
Full error:
The text was updated successfully, but these errors were encountered: