-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line returns missing in text_extraction() #2138
Comments
I've removed the whitespace:this deals with line return |
Whitespaces includes newlines. I just edited the description of the tag to make that explicit. To me those space / newline issues look related as I think we touch similar parts of the code and the types of issues the users have is similar. Am I wrong with that? |
the issue is coming from cm being modified at the "same time" as Tm:
in order to get the actual text position we need to compare tm.cm to tm_Prev.cm_prev (cm_prev is currently not saved)the big point is about the change merged from #2060 : we are passing tm_prev,but cm_matrix which is not consistent. |
@yonglee7015, the PR is now OK if you want to testit |
Yes,how can I test it? |
#2142 is the PR
I think I should document this somewhere 🤔 |
@MartinThoma does this trick works ? |
Yes! I completely forgot about that! |
By the way: Could you please rename it from PyPDF2 to pypdf? It might be confusing to others if they see PyPDF2. |
oups. used to not know how to do it |
Thanks 😊I will test it |
HI @pubpub-zz Yes, it works. Thanks for your help. Can you also test this pdf file? the page 3. 1,the first line of text in pdf goes to the last line in the output text. 2, the order of text in table is not correct Can you fixed this? I try another library tika-python, |
You have reached the limit of pypdf current implementation: Sorry there is no solution for the moment with pypdf.😞 |
Oh,no. It's so pity. Thank you. |
I have not tested it, but shouldn't a visitor be able to fix the order on the user side in this case? https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor |
It is more complex: you need to know if there is some columns what are the coordinates... Maybe ai could help... |
PDF file:
https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf
Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space.
the first page in my attached file.
Originally posted by @yonglee7015 in #2135 (reply in thread)
The text was updated successfully, but these errors were encountered: