Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to extract text from PDFs #829

Closed
azeemhusain811 opened this issue Feb 15, 2021 · 7 comments
Closed

Unable to extract text from PDFs #829

azeemhusain811 opened this issue Feb 15, 2021 · 7 comments
Labels
type:bug Something isn't working

Comments

@azeemhusain811
Copy link

Describe the bug
I am using this file conversion code snippet here. It is showing "File not found", but the file is there in the same folder.

Error message
image

Expected behavior
It should work smoothly.

Just reinstalled Haystack from PIP.

pip install Farm-Haystack

@azeemhusain811 azeemhusain811 added the type:bug Something isn't working label Feb 15, 2021
@xpatronum
Copy link
Contributor

Hi, @azeemhusain811!
Is uses xpdfreader under the hood.

command = ["pdftotext", "-enc", encoding, str(file_path), "-"]

Are you sure you have pdftotext installed correctly? Could you confirm that running directly from the terminal gives you result you expect?

@azeemhusain811
Copy link
Author

azeemhusain811 commented Feb 16, 2021

Hi @thenewera-ru
Thank you for your response.
I just checked but no luck. It is not working from the terminal also.

I am on windows

image

@xpatronum
Copy link
Contributor

Hi @thenewera-ru
Thank you for your response.
I have just checked but no luck. It is not working from the terminal also.

I am on windows

image

It looks like you don't have pdftotext installed. Please install it first here. Then reload your machine in order for Windows to be able to see pdftotext in PATH system variable.

@azeemhusain811
Copy link
Author

azeemhusain811 commented Feb 17, 2021

Hi @thenewera-ru,
It seems like there is some issue with pdftotext only, tried downloading from PyPi.
I will also try to install from the link which you mentioned, but an alternative way could be to use PyPDF2.

image

@azeemhusain811
Copy link
Author

azeemhusain811 commented Feb 18, 2021

There was something wrong with pdftotext. After going through pdftotext's documentation, I found some dependency.

Install Poppler first: conda install -c conda-forge poppler
then Install pdftotext: pip install pdftotext

Reference: jalan/pdftotext#16 (comment)

It worked for me.

The solution is for Windows users only.

@tholor please have a look, if we can add it somewhere in the documentation.

@tholor
Copy link
Member

tholor commented Feb 18, 2021

Thanks for reporting your solution here @azeemhusain811 .
We could add this in the error message here. We already have advice for Linux + MacOs. We could add one line for Windows.
Are you maybe interested in creating a pull request with these changes?

@azeemhusain811
Copy link
Author

Hi @tholor,
I am pretty much occupied nowadays with some other stuff.
I will definitely fix it whenever I will be available.

Thanks for your quick response, and sorry for responding late. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants