-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table extraction capabilities #231
Comments
PyPDF2's text extraction capabilities are somewhat primitive at the moment, though I want to make enhancing them a priority. Consider PDFMiner or PDFbox for your immediate needs; they feature much more sophisticated text extraction resources. It's not simple to recognize a table structure in PDF, but these libraries might be able to help. |
@mstamy2 any progress on this? I guess getting all text and it's position might be sufficient to get most usecases covered already. |
CC @sims1253 |
@mstamy2 hello, I'm interested in the Table-side too. As it is still open I suppose no progress on thi direction has been made .... is it right ? |
Table extraction is super hard. There are libraries which just attempt to do that (I think "Tabula" and "excalibur" were the names) |
I'm closing this for the moment as it distracts from other topics that seem more important at the moment. If anybody has an idea how to approach this in a reasonable way: I'm open for discussions and I can re-open :-) For people looking for solutions: The best I've got for you is https://pypi.org/project/camelot-py/ or developing something on your own, e.g. using a layout-preserving extraction (e.g. pdftotext -layout) + some heuristics. |
@MartinThoma |
Good idea @pubpub-zz 👍 See #1181 |
Hello,
I am converting a pdf file into a text file. In the extracted text file, I am unable to know where the table starts, however I am able to extract the text of the table as is, but I want to know where the table starts and ends so I could do some post processing on it.
Below is my code:
Kindly suggest.
The text was updated successfully, but these errors were encountered: