Unnecessary line breaks inserted #508

rubenvanerk · 2022-01-19T18:06:51Z

When I parse this file, many line breaks are inserted:

                                Marshalltown High SchoolHy-Tek's MEET MANAGER  1:04 PM  6/21/2021  Page 1
2021 Live Healthy Iowa Kids State Track Meet - 6/19/2021
Marshalltown High School	
Results	
 	Long Jump STANDING LJ 8 & Under Division Girls Age	
   Team	Name Finals
Finals Sibley
  7
Sandersfeld, Aubree
1 J5-00.00
DeWitt
  8
Borgen, Gracie
2 J5-00.00
Forest City
  8

This can be fixed by disabling this check:

pdfparser/src/Smalot/PdfParser/PDFObject.php

Line 274 in ddb14ae

if (((float) $x <= 0) ||

The output then becomes:

                                Marshalltown High SchoolHy-Tek's MEET MANAGER  1:04 PM  6/21/2021  Page 1
2021 Live Healthy Iowa Kids State Track Meet - 6/19/2021
Marshalltown High School	
Results	
 	Long Jump STANDING LJ 8 & Under Division Girls Age	
   Team	Name Finals
Finals Sibley  7Sandersfeld, Aubree 1 J5-00.00
DeWitt   8Borgen, Gracie 2 J5-00.00
Forest City   8Jenkins, Cleo 3 J5-00.00

I implemented a possible solution which allows someone to disable the check with a configuration value: rubenvanerk@27b9a26

I could create a PR for this but:

I don't know what to call this configuration value
I don't know if this is the best solution
It does seem like a very specific problem, no other issues seem to mention this or can be fixed with the proposed solution

The text was updated successfully, but these errors were encountered:

Reqrefusion · 2022-01-19T20:23:33Z

I encountered this problem that I did not know could not be solved in this way, but that I did not encounter much. It's a huge problem, especially with tables.
I did not save the files I encountered with this problem before, but I did a test for the official newspaper of the day and found it. This problem is observed at least once a day for these pdfs.
Sample:

MEVKİİ ADA
NO PARSEL
NO YÜZÖLÇÜMÜ

(m²) HAZİNE
HİSSESİ
CİNSİ

İMAR
DURUMU

TAHMİNİ
BEDEL(TL) GEÇİCİ
TEMİNAT
(TL) İHALE
GÜNÜ İHALE
SAATİ
1 34370103926
Beylikdüzü
Kavaklı (Beylikdüzü)

Related file: https://www.resmigazete.gov.tr/ilanlar/eskiilanlar/2022/01/20220119-3-5.pdf

Of course, for this file, I am not sure that this problem will be solved in this way. However, this is not a very specific problem.
If you are in doubt about this, you can create the value in a modifiable way in the config file. After all, the solution is the solution. It's always better to give people more choices. I have nothing more to say because you've already done this before.

k00ni added question bug help wanted missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. parsing fail When (almost) nothing can be extracted from a given PDF and removed question labels Jan 20, 2022

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary line breaks inserted #508

Unnecessary line breaks inserted #508

rubenvanerk commented Jan 19, 2022

Reqrefusion commented Jan 19, 2022

Unnecessary line breaks inserted #508

Unnecessary line breaks inserted #508

Comments

rubenvanerk commented Jan 19, 2022

Reqrefusion commented Jan 19, 2022