-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces not respected when dealing with numbers in a borderless table #568
Comments
Sorry to hear. Can you please try with custom option https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setfontspacelimit |
Dear k00ni, Thank you for responding. I tried setting -60 and just 0 to the setFontSpaceLimit, but there is no change in the output. Is there anything else that I can try?
|
If I remember correctly, this library has a few problems with whitespaces. You could also try the other whitespace-related options mentioned here https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#config-options-overview |
Dear k00ni, Thank you for cross checking. I have tried using the setHorizontalOffset setting it to either single space or tab character. Unfortunately it did not work. Do you think setting different values for setPdfWhitespaces will work? Can you please guide us into what configurations we could choose from? Also do you know other libraries we could try and check if they work better. IMO it looks mostly that the library is not respecting whitespaces between numbers. For the same table header where alphanumeric values are present, it is able to recognize spaces maintain the whitespace. Thanks. |
Dear k00ni, Interestingly, the data matrix seems to have the numbers captured separately. Capturing and analyzing the values from data matrix seems to be working for us now. Thanks.
|
Please please please help me with this. In my PDF table whenever there is any single number or character it is getting merged with the next value. Expected Output: Expected Output: |
@k00ni : can you please help me out |
Sorry, but I don't have the time right now. But you could try out https://github.com/tecnickcom/TCPDF and https://github.com/tecnickcom/tc-lib-pdf to see if they give you better results. |
35803_2-21-29-12-2022-R.pdf
Description:
The PDF has a borderless table with numeric data. The parser does not respect spaces between numbers
PDF input
PDF Attached
Expected output & actual output
Expected Output
Patient report Bio-Rad DATE: 12/29/2022 D-10 TIME: 01:45 PM S/N: #DJ0G520325 Software version: 4.30-2 Sample ID: 35803 Injection date 12/29/2022 12:21 PM Injection #: 21 Method: HbA2/F Rack #: --- Rack position: 2 Peak table - ID: 35803 Peak R.time Height Area Area % A1a 0.20 2955 16154 1.2 A1b 0.28 2911 13155 0.9 F 0.46 530 6306 < 0.8 * LA1c/CHb-1 0.64 964 7119 0.5 LA1c/CHb-2 0.73 1442 9759 0.7 A1c 0.88 5093 50721 5.5 P3 1.53 8531 67945 4.9 A0 1.76 259075 1181768 85.2 A2 3.38 2307 34548 2.8 Total Area: 1387477 Concentration: %F < 0.8 *A1c 5.5 A2 2.8
Actual Output
Patient report Bio-Rad DATE: 12/29/2022 D-10 TIME: 01:45 PM S/N: #DJ0G520325 Software version: 4.30-2 Sample ID: 35803 Injection date 12/29/2022 12:21 PM Injection #: 21 Method: HbA2/F Rack #: --- Rack position: 2Peak table - ID: 35803 Peak R.time Height Area Area % A1a 0.202955161541.2 A1b 0.282911131550.9 F 0.465306306< 0.8 * LA1c/CHb-1 0.6496471190.5 LA1c/CHb-2 0.73144297590.7 A1c 0.885093507215.5 P3 1.538531679454.9 A0 1.76259075118176885.2 A2 3.382307345482.8 Total Area: 1387477 Concentration: %F < 0.8 *A1c 5.5A2 2.8
Code
The text was updated successfully, but these errors were encountered: