Add table extraction capabilities #231

bonsonsm · 2015-10-14T06:11:26Z

Hello,
I am converting a pdf file into a text file. In the extracted text file, I am unable to know where the table starts, however I am able to extract the text of the table as is, but I want to know where the table starts and ends so I could do some post processing on it.

Below is my code:

def extractTextFromPDF(strDownloadDirectory, fileName, txtFilePath):
        filePathName = strDownloadDirectory + fileName
        pdfFileObj = open(filePathName, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        intPages = pdfReader.getNumPages()
        print(intPages)
        strText = ''
        print(fileName)
        fileName =fileName[0:len(fileName)-4]
        txtFilePath = txtFilePath +fileName  + '.txt'
        target_file = open(txtFilePath, "w" , encoding='utf-8')
        for i in range(0,intPages):
            objPDFObj = pdfReader.getPage(i)
            strText =  objPDFObj.extractText().rstrip()
            strText = " ".join(strText.replace(u"\xa0", " ").strip().split())
            print(strText)
        target_file.write(strText)
        target_file.close()

Kindly suggest.

mstamy2 · 2016-01-04T21:13:12Z

PyPDF2's text extraction capabilities are somewhat primitive at the moment, though I want to make enhancing them a priority.

Consider PDFMiner or PDFbox for your immediate needs; they feature much more sophisticated text extraction resources.

It's not simple to recognize a table structure in PDF, but these libraries might be able to help.

sils · 2016-11-07T12:38:51Z

@mstamy2 any progress on this? I guess getting all text and it's position might be sufficient to get most usecases covered already.

sils · 2016-11-07T12:39:11Z

CC @sims1253

emanuelevivoli · 2021-10-12T14:20:00Z

@mstamy2 hello, I'm interested in the Table-side too. As it is still open I suppose no progress on thi direction has been made .... is it right ?
Please let me know :)
Thanks

MartinThoma · 2022-06-06T13:03:44Z

Table extraction is super hard. There are libraries which just attempt to do that (I think "Tabula" and "excalibur" were the names)

MartinThoma · 2022-06-17T05:39:32Z

I've just noticed "TABLE 10.29 Standard layout attributes" with

MartinThoma · 2022-07-29T16:53:37Z

I'm closing this for the moment as it distracts from other topics that seem more important at the moment.

If anybody has an idea how to approach this in a reasonable way: I'm open for discussions and I can re-open :-)

For people looking for solutions: The best I've got for you is https://pypi.org/project/camelot-py/ or developing something on your own, e.g. using a layout-preserving extraction (e.g. pdftotext -layout) + some heuristics.

pubpub-zz · 2022-07-29T17:39:31Z

@MartinThoma
Perhaps should you open a discussion listing all those features which are left on slide for the moment in order to find them directly

MartinThoma · 2022-07-29T18:01:25Z

Good idea @pubpub-zz 👍 See #1181

mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016

MartinThoma added the is-feature A feature request label Jun 10, 2022

MartinThoma changed the title ~~Unable to identify the tables~~ Add table extraction capabilities Jun 17, 2022

MartinThoma closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table extraction capabilities #231

Add table extraction capabilities #231

bonsonsm commented Oct 14, 2015 •

edited by MartinThoma

Loading

mstamy2 commented Jan 4, 2016

sils commented Nov 7, 2016

sils commented Nov 7, 2016

emanuelevivoli commented Oct 12, 2021 •

edited

Loading

MartinThoma commented Jun 6, 2022

MartinThoma commented Jun 17, 2022

MartinThoma commented Jul 29, 2022

pubpub-zz commented Jul 29, 2022

MartinThoma commented Jul 29, 2022

Add table extraction capabilities #231

Add table extraction capabilities #231

Comments

bonsonsm commented Oct 14, 2015 • edited by MartinThoma Loading

mstamy2 commented Jan 4, 2016

sils commented Nov 7, 2016

sils commented Nov 7, 2016

emanuelevivoli commented Oct 12, 2021 • edited Loading

MartinThoma commented Jun 6, 2022

MartinThoma commented Jun 17, 2022

MartinThoma commented Jul 29, 2022

pubpub-zz commented Jul 29, 2022

MartinThoma commented Jul 29, 2022

bonsonsm commented Oct 14, 2015 •

edited by MartinThoma

Loading

emanuelevivoli commented Oct 12, 2021 •

edited

Loading