v2.1 extract_text() misses newline characters #957

S1SYPHOS · 2022-06-07T09:16:18Z

Hey there,
when updating from v2.0 to v2.1, extracted words that were separated by whitespaces whitespaces before are now glued together, (see below for example).

Environment

Machine: Linux-5.17.5-76051705-generic-x86_64-with-glibc2.34
PyPDF: 2.1.0

Code

This is a minimal, complete example that shows the issue:

import PyPDF2

# First page
page = PyPDF2.PdfReader('tests/fixtures/test.pdf').pages[0]

print(page.extract_text())

Now, output with v2.0 was like this:

Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.:07612050
FreiburgimBreisgau,19.11.2021
Sitzungsplan der Staatsanwaltschaft
Zeitraum
29.11.2021
-
03.12.2021
Eildienst
26.11.2021
-
29.11.2021
Plattner, Adalbert , EOAA

Using v2.1, I get this:

Staatsanwaltschaft FreiburgBerliner Allee 1, 79114 Freiburg im BreisgauTel.: 0761 20 50Freiburg im Breisgau, 19.11.2021

Sitzungsplan der Staatsanwaltschaft



Zeitraum



29.11.2021-03.12.2021Eildienst26.11.2021-29.11.2021Plattner, Adalbert , EOAA

PDF

PDF file from example can be found here. The names were redacted, so no personal information despite the looks of it.

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-06-07T11:23:29Z

Thank you for sharing and putting the time into writing an awesome Bug report!
We will investigate it.

In the mean time, you can use _extract_text_old to get the pre-2.1 behavior.

S1SYPHOS · 2022-06-07T12:02:07Z

Yeah, for the time being I constrained its version like 'PyPDF2==2.0.0', but you are right!

MartinThoma · 2022-06-07T12:08:05Z

2.0 Had issues with spaces between words, e.g. FreiburgimBreisgau
2.1 Has issues with newlines, e.g. Staatsanwaltschaft FreiburgBerliner Allee 1

MartinThoma · 2022-06-07T12:19:56Z

Might be related to #591

pubpub-zz · 2022-06-07T19:32:42Z

I've started to have a look at the file, and the pdf shows cases I would have never guess. the Tm matrix shows an inverted which means that the document is filled upside/down.

Correction is under analysis...

S1SYPHOS · 2022-06-07T21:14:28Z

Glad it's an edge case, never would've guessed ;)

MartinThoma · 2022-06-19T08:26:03Z

I just confirmed that this is still an issue with the current master (soon PyPDF2==2.3.0) 😢

pubpub-zz · 2022-06-19T08:36:38Z

the fix was not issued still working on...

pubpub-zz · 2022-07-09T21:04:23Z

improved by PR #1084

* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding #1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves #234 Improves #957 Closes #1003 Closes #1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

creepiepanda · 2023-03-06T14:39:07Z

Not sure if this is related. I was using 2.8.1 and everything worked perfectly but any version above (2.9.0 and higher) had the same issue for me. With 2.9.0 and higher the output for reader.pages[0].extract_text() came entirely without newlines.

pubpub-zz · 2023-03-06T17:38:07Z

@creepiepanda can you confirm this was detected with the same pdf ? if so can you provide it?

creepiepanda · 2023-03-06T17:42:04Z

Yes it's the same PDF every time. I switched pypdf versions mutliple times while trying with the same file.
crazy_bean.pdf

pubpub-zz · 2024-04-08T20:43:21Z

extract_text() has now layout extraction_mode.
This solves now this very old issue.

MartinThoma · 2024-04-08T20:53:09Z

Just for reference

import pypdf
from pypdf import PdfReader

print(f"pypdf=={pypdf.__version__}")
print(PdfReader("test.pdf").pages[0].extract_text())

gives:

pypdf==4.1.0
Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.: 0761 20 50 Freiburg im Breisgau, 19.11.2021
Sitzungsplan der Staatsanwaltschaft
Zeitraum 29.11.2021 -03.12.2021
Eildienst 26.11.2021 -29.11.2021 Plattner, Adalbert , EOAA
Eildienst 29.11.2021 -03.12.2021 Häberling, Max , StA
Eildienst 03.12.2021 -06.12.2021 Dr. Eisner, Julien , StA
Tag
Gericht-SpruchkörperSaal,
GebäudeZeitF+Aktenzeichen Sitzungsvertreter
Anfahrt
Montag 29.11.2021
LG Freiburg im Breisgau , -
Strafkammer XVI-IV, 2. OG 09:00F+668 Js 13432/21 Lederer, Chloe , StA'in
Böhm , Lara, StA'in
AG Emmendingen , - Schöffengericht
-09:00 429 Js 16713/21 Pistorius, Adalbert , OAA
AG Freiburg im Breisgau , - Abt. 24 - EG 09:00 214 Js 38759/19 Herrmann(640), Lia, Ref'in
AG Freiburg im Breisgau , - Abt. 26 - IX 09:15 580 Js 39067/19 Mayer (650), Leo, Ref
AG Freiburg im Breisgau , - Abt. 32 - EG 09:00F216 Js 29564/19 Dr. Eisner, Julien , StA
AG Freiburg im Breisgau , - Abt. 34 - XI, 1. OG 13:00+277 Js 26886/19 Braunke, Kurt , StA
AG Freiburg im Breisgau , - Abt. 17 - 09:00 168 Js 24544/22 Bader, Alina-Karla , StA'in
10:30 537 Js 24193/22 Bader, Alina-Karla , StA'in
VIII, Holzmarkt 6 11:00 678 Js 23165/22 Bader, Alina-Karla , StA'in
11:45 116 Js 45639/19 Bader, Alina-Karla , StA'in
AG Freiburg im Breisgau , - Abt. 18 - VII 09:00 171 Js 33617/22
10:45 132 Js 38282/21
13:00 141 Js 45191/20
13:30 169 Js 50915/22
14:15 583 Js 18397/22
AG Freiburg im Breisgau , - Abt. 20 - III, EG 09:00 617 Js 44625/19 Raineke, Patrick , EStA
16:30+635 Js 46988/19 Hofmann, Krause, OStA
Raineke, Patrick , EStA
AG Freiburg im Breisgau , - Abt. 21 - IV, 1. OG 09:00+248 Js 53757/21 Häberling, Max , StA
AG Kenzingen , Strafabteilung 4, EG 09:00 637 Js 20701/21 Sägezahn, Ida , StA'in
10:00 168 Js 52660/20 Sägezahn, Ida , StA'in
11:30 187 Js 19607/20 Sägezahn, Ida , StA'in
AG Müllheim , - Strafabteilung - 09:00 345 Js 50760/22 Bauer (210), Joel, Ref
AG Staufen im Breisgau , -
Strafabteilung -09:00+474 Js 50679/19 Freygang, Ole , EStA
Dienstag 30.11.2021
LG Freiburg im Breisgau , -
Strafkammer II -IV 09:00F512 Js 40456/20 Brodesser, Boris , EStA
LG Freiburg im Breisgau , -
Strafkammer V -09:00 340 Js 22587/19 Luhmann, Jasmin, StA'in
LG Freiburg im Breisgau , -
Strafkammer XIV -09:00 289 Js 21296/22 Knorzig , Kathleen , StA'in
09:00 273 Js 55642/21 Knorzig , Kathleen , StA'in
Seite 1

and print(PdfReader("test.pdf").pages[0].extract_text(extraction_mode="layout"))

gives:

Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.: 0761 20 50                                                                                                                         Freiburg im Breisgau, 19.11.2021

Sitzungsplan der Staatsanwaltschaft
Zeitraum                    29.11.2021     - 03.12.2021


Eildienst                   26.11.2021     - 29.11.2021           Plattner, Adalbert , EOAA
Eildienst                   29.11.2021     - 03.12.2021           Häberling, Max , StA
Eildienst                   03.12.2021     - 06.12.2021           Dr. Eisner, Julien , StA


Tag                                          Saal,                     Zeit     F+  Aktenzeichen               Sitzungsvertreter
Gericht-Spruchkörper                         Gebäude                                                           Anfahrt

Montag                 29.11.2021
LG Freiburg im Breisgau , -                  IV, 2. OG                 09:00    F+  668 Js 13432/21            Lederer, Chloe , StA'in
Strafkammer XVI-                                                                                               Böhm , Lara, StA'in
AG Emmendingen , - Schöffengericht                                     09:00        429 Js 16713/21            Pistorius, Adalbert , OAA
-
AG Freiburg im Breisgau , - Abt. 24 -EG                                09:00        214 Js 38759/19            Herrmann(640), Lia, Ref'in
AG Freiburg im Breisgau , - Abt. 26 -IX                                09:15        580 Js 39067/19            Mayer (650), Leo, Ref
AG Freiburg im Breisgau , - Abt. 32 -EG                                09:00    F   216 Js 29564/19            Dr. Eisner, Julien , StA
AG Freiburg im Breisgau , - Abt. 34 -XI, 1. OG                         13:00     +  277 Js 26886/19            Braunke, Kurt , StA
AG Freiburg im Breisgau , - Abt. 17 -                                  09:00        168 Js 24544/22            Bader, Alina-Karla , StA'in
                                                                       10:30        537 Js 24193/22            Bader, Alina-Karla , StA'in
                                             VIII, Holzmarkt 6         11:00        678 Js 23165/22            Bader, Alina-Karla , StA'in
                                                                       11:45        116 Js 45639/19            Bader, Alina-Karla , StA'in
AG Freiburg im Breisgau , - Abt. 18 -VII                               09:00        171 Js 33617/22
                                                                       10:45        132 Js 38282/21
                                                                       13:00        141 Js 45191/20
                                                                       13:30        169 Js 50915/22
                                                                       14:15        583 Js 18397/22
AG Freiburg im Breisgau , - Abt. 20 -III, EG                           09:00        617 Js 44625/19            Raineke, Patrick , EStA
                                                                       16:30     +  635 Js 46988/19            Hofmann, Krause, OStA
                                                                                                               Raineke, Patrick , EStA
AG Freiburg im Breisgau , - Abt. 21 -IV, 1. OG                         09:00     +  248 Js 53757/21            Häberling, Max , StA
AG Kenzingen , Strafabteilung                4, EG                     09:00        637 Js 20701/21            Sägezahn, Ida , StA'in
                                                                       10:00        168 Js 52660/20            Sägezahn, Ida , StA'in
                                                                       11:30        187 Js 19607/20            Sägezahn, Ida , StA'in
AG Müllheim , - Strafabteilung -                                       09:00        345 Js 50760/22            Bauer (210), Joel, Ref
AG Staufen im Breisgau , -                                             09:00     +  474 Js 50679/19            Freygang, Ole , EStA
Strafabteilung -


Dienstag               30.11.2021
LG Freiburg im Breisgau , -                  IV                        09:00    F   512 Js 40456/20            Brodesser, Boris , EStA
Strafkammer II -
LG Freiburg im Breisgau , -                                            09:00        340 Js 22587/19            Luhmann, Jasmin, StA'in
Strafkammer V -
LG Freiburg im Breisgau , -                                            09:00        289 Js 21296/22            Knorzig , Kathleen , StA'in
Strafkammer XIV -
                                                                       09:00        273 Js 55642/21            Knorzig , Kathleen , StA'in




                                                                                                                                              Seite 1

S1SYPHOS added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 7, 2022

S1SYPHOS assigned MartinThoma Jun 7, 2022

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 7, 2022

MartinThoma changed the title ~~v2.1 output of extract_text() glued together~~ v2.1 extract_text() misses newline characters Jun 11, 2022

pubpub-zz mentioned this issue Jun 11, 2022

improved ExtractText(3) #969

Merged

MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 19, 2022

MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 19, 2022

MartinThoma mentioned this issue Jul 11, 2022

ENH: Extract Text Enhancement (whitespaces) #1084

Merged

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

stefan6419846 unassigned MartinThoma Feb 20, 2024

pubpub-zz closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1 extract_text() misses newline characters #957

v2.1 extract_text() misses newline characters #957

S1SYPHOS commented Jun 7, 2022

MartinThoma commented Jun 7, 2022

S1SYPHOS commented Jun 7, 2022

MartinThoma commented Jun 7, 2022

MartinThoma commented Jun 7, 2022

pubpub-zz commented Jun 7, 2022

S1SYPHOS commented Jun 7, 2022

MartinThoma commented Jun 19, 2022

pubpub-zz commented Jun 19, 2022

pubpub-zz commented Jul 9, 2022

creepiepanda commented Mar 6, 2023

pubpub-zz commented Mar 6, 2023

creepiepanda commented Mar 6, 2023

pubpub-zz commented Apr 8, 2024

MartinThoma commented Apr 8, 2024

v2.1 extract_text() misses newline characters #957

v2.1 extract_text() misses newline characters #957

Comments

S1SYPHOS commented Jun 7, 2022

Environment

Code

PDF

MartinThoma commented Jun 7, 2022

S1SYPHOS commented Jun 7, 2022

MartinThoma commented Jun 7, 2022

MartinThoma commented Jun 7, 2022

pubpub-zz commented Jun 7, 2022

S1SYPHOS commented Jun 7, 2022

MartinThoma commented Jun 19, 2022

pubpub-zz commented Jun 19, 2022

pubpub-zz commented Jul 9, 2022

creepiepanda commented Mar 6, 2023

pubpub-zz commented Mar 6, 2023

creepiepanda commented Mar 6, 2023

pubpub-zz commented Apr 8, 2024

MartinThoma commented Apr 8, 2024