Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary line breaks inserted #508

Closed
rubenvanerk opened this issue Jan 19, 2022 · 1 comment · Fixed by #634
Closed

Unnecessary line breaks inserted #508

rubenvanerk opened this issue Jan 19, 2022 · 1 comment · Fixed by #634
Labels
bug help wanted missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. parsing fail When (almost) nothing can be extracted from a given PDF

Comments

@rubenvanerk
Copy link
Contributor

When I parse this file, many line breaks are inserted:

                                Marshalltown High SchoolHy-Tek's MEET MANAGER  1:04 PM  6/21/2021  Page 1
2021 Live Healthy Iowa Kids State Track Meet - 6/19/2021
Marshalltown High School	
Results	
 	Long Jump STANDING LJ 8 & Under Division Girls Age	
   Team	Name Finals
Finals Sibley
  7
Sandersfeld, Aubree
1 J5-00.00
DeWitt
  8
Borgen, Gracie
2 J5-00.00
Forest City
  8

This can be fixed by disabling this check:

if (((float) $x <= 0) ||

The output then becomes:

                                Marshalltown High SchoolHy-Tek's MEET MANAGER  1:04 PM  6/21/2021  Page 1
2021 Live Healthy Iowa Kids State Track Meet - 6/19/2021
Marshalltown High School	
Results	
 	Long Jump STANDING LJ 8 & Under Division Girls Age	
   Team	Name Finals
Finals Sibley  7Sandersfeld, Aubree 1 J5-00.00
DeWitt   8Borgen, Gracie 2 J5-00.00
Forest City   8Jenkins, Cleo 3 J5-00.00

I implemented a possible solution which allows someone to disable the check with a configuration value: rubenvanerk@27b9a26

I could create a PR for this but:

  • I don't know what to call this configuration value
  • I don't know if this is the best solution
  • It does seem like a very specific problem, no other issues seem to mention this or can be fixed with the proposed solution
@Reqrefusion
Copy link

I encountered this problem that I did not know could not be solved in this way, but that I did not encounter much. It's a huge problem, especially with tables.
I did not save the files I encountered with this problem before, but I did a test for the official newspaper of the day and found it. This problem is observed at least once a day for these pdfs.
Sample:

MEVKİİ ADA
NO PARSEL
NO YÜZÖLÇÜMÜ

(m²) HAZİNE
HİSSESİ
CİNSİ

İMAR
DURUMU

TAHMİNİ
BEDEL(TL) GEÇİCİ
TEMİNAT
(TL) İHALE
GÜNÜ İHALE
SAATİ
1 34370103926
Beylikdüzü
Kavaklı (Beylikdüzü)

Related file: https://www.resmigazete.gov.tr/ilanlar/eskiilanlar/2022/01/20220119-3-5.pdf

Of course, for this file, I am not sure that this problem will be solved in this way. However, this is not a very specific problem.
If you are in doubt about this, you can create the value in a modifiable way in the config file. After all, the solution is the solution. It's always better to give people more choices. I have nothing more to say because you've already done this before.

@k00ni k00ni added question bug help wanted missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. parsing fail When (almost) nothing can be extracted from a given PDF and removed question labels Jan 20, 2022
@k00ni k00ni closed this as completed in #634 Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. parsing fail When (almost) nothing can be extracted from a given PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants