Unable to extract text from PDFs #829

azeemhusain811 · 2021-02-15T11:18:23Z

Describe the bug
I am using this file conversion code snippet here. It is showing "File not found", but the file is there in the same folder.

Error message

Expected behavior
It should work smoothly.

Just reinstalled Haystack from PIP.

pip install Farm-Haystack

xpatronum · 2021-02-16T11:29:23Z

Hi, @azeemhusain811!
Is uses xpdfreader under the hood.

command = ["pdftotext", "-enc", encoding, str(file_path), "-"]

Are you sure you have pdftotext installed correctly? Could you confirm that running directly from the terminal gives you result you expect?

azeemhusain811 · 2021-02-16T12:11:20Z

Hi @thenewera-ru
Thank you for your response.
I just checked but no luck. It is not working from the terminal also.

I am on windows

xpatronum · 2021-02-17T08:06:55Z

Hi @thenewera-ru
Thank you for your response.
I have just checked but no luck. It is not working from the terminal also.

I am on windows

It looks like you don't have pdftotext installed. Please install it first here. Then reload your machine in order for Windows to be able to see pdftotext in PATH system variable.

azeemhusain811 · 2021-02-17T14:39:47Z

Hi @thenewera-ru,
It seems like there is some issue with pdftotext only, tried downloading from PyPi.
I will also try to install from the link which you mentioned, but an alternative way could be to use PyPDF2.

azeemhusain811 · 2021-02-18T07:53:26Z

There was something wrong with pdftotext. After going through pdftotext's documentation, I found some dependency.

Install Poppler first: conda install -c conda-forge poppler
then Install pdftotext: pip install pdftotext

Reference: jalan/pdftotext#16 (comment)

It worked for me.

The solution is for Windows users only.

@tholor please have a look, if we can add it somewhere in the documentation.

tholor · 2021-02-18T08:56:11Z

Thanks for reporting your solution here @azeemhusain811 .
We could add this in the error message here. We already have advice for Linux + MacOs. We could add one line for Windows.
Are you maybe interested in creating a pull request with these changes?

azeemhusain811 · 2021-03-02T19:15:56Z

Hi @tholor,
I am pretty much occupied nowadays with some other stuff.
I will definitely fix it whenever I will be available.

Thanks for your quick response, and sorry for responding late. 😄

azeemhusain811 added the type:bug Something isn't working label Feb 15, 2021

azeemhusain811 closed this as completed Feb 15, 2021

azeemhusain811 reopened this Feb 15, 2021

azeemhusain811 closed this as completed Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to extract text from PDFs #829

Unable to extract text from PDFs #829

azeemhusain811 commented Feb 15, 2021

xpatronum commented Feb 16, 2021

azeemhusain811 commented Feb 16, 2021 •

edited

Loading

xpatronum commented Feb 17, 2021

azeemhusain811 commented Feb 17, 2021 •

edited

Loading

azeemhusain811 commented Feb 18, 2021 •

edited

Loading

tholor commented Feb 18, 2021

azeemhusain811 commented Mar 2, 2021

Unable to extract text from PDFs #829

Unable to extract text from PDFs #829

Comments

azeemhusain811 commented Feb 15, 2021

xpatronum commented Feb 16, 2021

azeemhusain811 commented Feb 16, 2021 • edited Loading

xpatronum commented Feb 17, 2021

azeemhusain811 commented Feb 17, 2021 • edited Loading

azeemhusain811 commented Feb 18, 2021 • edited Loading

tholor commented Feb 18, 2021

azeemhusain811 commented Mar 2, 2021

azeemhusain811 commented Feb 16, 2021 •

edited

Loading

azeemhusain811 commented Feb 17, 2021 •

edited

Loading

azeemhusain811 commented Feb 18, 2021 •

edited

Loading