Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces not respected when dealing with numbers in a borderless table #568

Closed
jagritiinnohealth opened this issue Jan 2, 2023 · 8 comments · Fixed by #634
Closed

Spaces not respected when dealing with numbers in a borderless table #568

jagritiinnohealth opened this issue Jan 2, 2023 · 8 comments · Fixed by #634
Labels

Comments

@jagritiinnohealth
Copy link

jagritiinnohealth commented Jan 2, 2023

Description:

The PDF has a borderless table with numeric data. The parser does not respect spaces between numbers

PDF input

PDF Attached

Expected output & actual output

Expected Output
Patient report Bio-Rad DATE: 12/29/2022 D-10 TIME: 01:45 PM S/N: #DJ0G520325 Software version: 4.30-2 Sample ID: 35803 Injection date 12/29/2022 12:21 PM Injection #: 21 Method: HbA2/F Rack #: --- Rack position: 2 Peak table - ID: 35803 Peak R.time Height Area Area % A1a 0.20 2955 16154 1.2 A1b 0.28 2911 13155 0.9 F 0.46 530 6306 < 0.8 * LA1c/CHb-1 0.64 964 7119 0.5 LA1c/CHb-2 0.73 1442 9759 0.7 A1c 0.88 5093 50721 5.5 P3 1.53 8531 67945 4.9 A0 1.76 259075 1181768 85.2 A2 3.38 2307 34548 2.8 Total Area: 1387477 Concentration: %F < 0.8 *A1c 5.5 A2 2.8

Actual Output
Patient report Bio-Rad DATE: 12/29/2022 D-10 TIME: 01:45 PM S/N: #DJ0G520325 Software version: 4.30-2 Sample ID: 35803 Injection date 12/29/2022 12:21 PM Injection #: 21 Method: HbA2/F Rack #: --- Rack position: 2Peak table - ID: 35803 Peak R.time Height Area Area % A1a 0.202955161541.2 A1b 0.282911131550.9 F 0.465306306< 0.8 * LA1c/CHb-1 0.6496471190.5 LA1c/CHb-2 0.73144297590.7 A1c 0.885093507215.5 P3 1.538531679454.9 A0 1.76259075118176885.2 A2 3.382307345482.8 Total Area: 1387477 Concentration: %F < 0.8 *A1c 5.5A2 2.8

Code

$parser = new \Smalot\PdfParser\Parser();
$PDFfile = $fileEntity->getFileUri();
$PDF = $parser->parseFile($PDFfile);
$PDFContent = $PDF->getText();
echo $PDFContent;
@k00ni k00ni added the bug label Jan 3, 2023
@k00ni
Copy link
Collaborator

k00ni commented Jan 3, 2023

Sorry to hear. Can you please try with custom option setFontSpaceLimit again?

https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setfontspacelimit

@amitsedai
Copy link

amitsedai commented Jan 3, 2023

Dear k00ni,

Thank you for responding. I tried setting -60 and just 0 to the setFontSpaceLimit, but there is no change in the output. Is there anything else that I can try?


$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
// $config->setFontSpaceLimit(0);
$parser = new \Smalot\PdfParser\Parser([], $config);

@k00ni
Copy link
Collaborator

k00ni commented Jan 3, 2023

If I remember correctly, this library has a few problems with whitespaces. You could also try the other whitespace-related options mentioned here https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#config-options-overview

@amitsedai
Copy link

Dear k00ni,

Thank you for cross checking.

I have tried using the setHorizontalOffset setting it to either single space or tab character. Unfortunately it did not work. Do you think setting different values for setPdfWhitespaces will work? Can you please guide us into what configurations we could choose from? Also do you know other libraries we could try and check if they work better.

IMO it looks mostly that the library is not respecting whitespaces between numbers. For the same table header where alphanumeric values are present, it is able to recognize spaces maintain the whitespace.

Thanks.

@amitsedai
Copy link

Dear k00ni,

Interestingly, the data matrix seems to have the numbers captured separately. Capturing and analyzing the values from data matrix seems to be working for us now. Thanks.

$data = $pdfObject->getPages()[0]->getDataTm();

@sabit12
Copy link

sabit12 commented Mar 23, 2023

Please please please help me with this.

In my PDF table whenever there is any single number or character it is getting merged with the next value.
I have tried using the setHorizontalOffset setting it to either single space or tab character, also tried setFontSpaceLimit but still it did not worked at all. Please help me

Expected Output:
31/01/2023 20,000,000 0 0 STD No
Getting Output:
31/01/2023 20,000,000 00STD No

Expected Output:
30/11/2022 875 7 SS
Getting Output:
30/11/2022 875 7SS

@sabit12
Copy link

sabit12 commented Apr 3, 2023

@k00ni : can you please help me out

@k00ni
Copy link
Collaborator

k00ni commented Apr 3, 2023

Sorry, but I don't have the time right now. But you could try out https://github.com/tecnickcom/TCPDF and https://github.com/tecnickcom/tc-lib-pdf to see if they give you better results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants