Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.py set_edges IndexError: list index out of range #217

Closed
rvalyi opened this issue Dec 4, 2018 · 4 comments
Closed

core.py set_edges IndexError: list index out of range #217

rvalyi opened this issue Dec 4, 2018 · 4 comments
Labels
Milestone

Comments

@rvalyi
Copy link

rvalyi commented Dec 4, 2018

Hello. First thanks for this terrific library, I get much better results than with Tabula, and this will be quite decisive to implement the open source Brazilian localization of the Odoo ERP (in the OCA foundation).

But when extracting pages 200 to 210 from this pdf (Brazilian fiscal stuff) http://sped.rfb.gov.br/arquivo/download/2322 using the latest Github version (rev 7bdd9a3)
I get this stack trace:

extracting pages 200 to 210...
Traceback (most recent call last):
  File "./extract_csv.py", line 25, in <module>
    extract_csv('efd_icms_ipi', 262)
  File "./extract_csv.py", line 18, in extract_csv
    pages='%s-%s' % (i, limit), line_size_scaling=80)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/io.py", line 101, in read_pdf
    tables = p.parse(flavor=flavor, **kwargs)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/handlers.py", line 154, in parse
    t = parser.extract_tables(p)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 364, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 304, in _generate_table
    table = table.set_edges(v_s, h_s, joint_close_tol=self.joint_close_tol)
  File "/home/rvalyi/.local/lib/python3.6/site-packages/camelot/core.py", line 460, in set_edges
    self.cells[L][J].bottom = True
IndexError: list index out of range

As a naive and brutal counter measure I changed the line 459 to:

                    while J < K and L <= len(self.cells) and J <= len(self.cells[L]):
                        self.cells[L][J].bottom = True

So far it looks it makes it extract the data properly...

@vinayak-mehta
Copy link
Contributor

@rvalyi Thanks for the report! Let me also try to reproduce the error.

@vinayak-mehta vinayak-mehta added this to the v0.5.0 milestone Dec 7, 2018
@vinayak-mehta
Copy link
Contributor

vinayak-mehta commented Dec 7, 2018

@rvalyi Can you also share the code that produced this error? I tried camelot --format csv --output legendado.csv -p 200-210 lattice legendado.pdf and it worked without any errors.

EDIT: Nvm, I found the advanced settings you used in the traceback. It happens with -scale 80.

@rvalyi
Copy link
Author

rvalyi commented Dec 7, 2018

indeed I used the line_size_scaling=80 option in the Python API. Thanks for the fix, I will test again soon.

I also have other tables where Camelot produce no such stacktrace but introduce line breaks and blank cells. I can live by implementing heuristics later in my code, but are you interested in such new bug reports?

@vinayak-mehta
Copy link
Contributor

Thanks for reporting the issue!

Yes, please report other issues that you've experienced. I'm fixing an issue that introduces line breaks for v0.5.0, along with some other text-to-cell assignment behaviors. You can check https://github.com/socialcopsdev/camelot/milestone/3 for the complete list. Please check that you have a bug that isn't present in the list or in the issue tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants