core.py set_edges IndexError: list index out of range #217

rvalyi · 2018-12-04T23:01:20Z

Hello. First thanks for this terrific library, I get much better results than with Tabula, and this will be quite decisive to implement the open source Brazilian localization of the Odoo ERP (in the OCA foundation).

But when extracting pages 200 to 210 from this pdf (Brazilian fiscal stuff) http://sped.rfb.gov.br/arquivo/download/2322 using the latest Github version (rev 7bdd9a3)
I get this stack trace:

extracting pages 200 to 210...
Traceback (most recent call last):
  File "./extract_csv.py", line 25, in <module>
    extract_csv('efd_icms_ipi', 262)
  File "./extract_csv.py", line 18, in extract_csv
    pages='%s-%s' % (i, limit), line_size_scaling=80)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/io.py", line 101, in read_pdf
    tables = p.parse(flavor=flavor, **kwargs)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/handlers.py", line 154, in parse
    t = parser.extract_tables(p)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 364, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 304, in _generate_table
    table = table.set_edges(v_s, h_s, joint_close_tol=self.joint_close_tol)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/core.py", line 460, in set_edges
    self.cells[L][J].bottom = True
IndexError: list index out of range

As a naive and brutal counter measure I changed the line 459 to:

                    while J < K and L <= len(self.cells) and J <= len(self.cells[L]):
                        self.cells[L][J].bottom = True

So far it looks it makes it extract the data properly...

The text was updated successfully, but these errors were encountered:

vinayak-mehta · 2018-12-05T12:48:57Z

@rvalyi Thanks for the report! Let me also try to reproduce the error.

vinayak-mehta · 2018-12-07T12:04:48Z

@rvalyi Can you also share the code that produced this error? I tried camelot --format csv --output legendado.csv -p 200-210 lattice legendado.pdf and it worked without any errors.

EDIT: Nvm, I found the advanced settings you used in the traceback. It happens with -scale 80.

rvalyi · 2018-12-07T16:40:55Z

indeed I used the line_size_scaling=80 option in the Python API. Thanks for the fix, I will test again soon.

I also have other tables where Camelot produce no such stacktrace but introduce line breaks and blank cells. I can live by implementing heuristics later in my code, but are you interested in such new bug reports?

vinayak-mehta · 2018-12-07T17:37:04Z

Thanks for reporting the issue!

Yes, please report other issues that you've experienced. I'm fixing an issue that introduces line breaks for v0.5.0, along with some other text-to-cell assignment behaviors. You can check https://github.com/socialcopsdev/camelot/milestone/3 for the complete list. Please check that you have a bug that isn't present in the list or in the issue tracker.

vinayak-mehta added the bug label Dec 7, 2018

vinayak-mehta added this to the v0.5.0 milestone Dec 7, 2018

vinayak-mehta mentioned this issue Dec 7, 2018

[MRG] Fix variable name #221

Merged

vinayak-mehta closed this as completed in #221 Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core.py set_edges IndexError: list index out of range #217

core.py set_edges IndexError: list index out of range #217

rvalyi commented Dec 4, 2018

vinayak-mehta commented Dec 5, 2018

vinayak-mehta commented Dec 7, 2018 •

edited

Loading

rvalyi commented Dec 7, 2018

vinayak-mehta commented Dec 7, 2018

core.py set_edges IndexError: list index out of range #217

core.py set_edges IndexError: list index out of range #217

Comments

rvalyi commented Dec 4, 2018

vinayak-mehta commented Dec 5, 2018

vinayak-mehta commented Dec 7, 2018 • edited Loading

rvalyi commented Dec 7, 2018

vinayak-mehta commented Dec 7, 2018

vinayak-mehta commented Dec 7, 2018 •

edited

Loading