Inserting white spaces beetween letters #72

ricardobrg · 2015-07-04T04:23:47Z

When I try to extract text from this file (http://billybala.brgweb.com.br/tmp/1435983113.pdf) it inserts white space beetween letters:

The code:

   $fulltext = 'Full text: ';
   $parser = new \Smalot\PdfParser\Parser();
   $pdfsource = $parser->parseFile($pdf);
   $pages  = $pdfsource->getPages();
   $pagecount = count($pages);
   $output .= "Total pages: $pagecount<br>";
   // Loop over each page to extract text.
   foreach ($pages as $page) {
    $fulltext .= utf8_decode($page->getText());
   }
   echo $fulltext;

The output is:

Full text: C 3 9 9 0 0 9 N T O U D e a r E d i t o r : W i t h g r o w i n g p o p u l a t i o n a n d e v e r m o r e a d v a n c e d t e c h n o l o g i e s , n a t u r a l r e s o u r c e s p r o v i d e d b y l a n d c a n n o l o n g e r f u l f i l l t h e i n c r e a s i n g d e m a n d s o f t h e h u m a n p o p u l a t i o n . I n a n a t t e m p t t o e n s u r e t h e i r m a r i n e r i g h t s a n d i n t e r e s t s , m a n y c o u n t r i e s h a v e t a k e n t h e s t e p t o e s t a b l i s h c o m p e t e n t m a r i n e a u t h o r i t i e s . T h e s e a u t h o r i t i e s a r e a s s i g n e d t h e m i s s i o n t o i n t e g r a t e o c e a n p o l i c i e s , a n d h a v e t h e r e s p o n s i b i l i t y t o o v e r s e e v a r i o u s m a r i n e a f f a i r s . T h i s p a p e r g i v e s i n s i g h t i n t o t h e s c o p e o f m a r i n e a f f a i r s , a n d s u m m a r i z e s t h e p r e s e n t s t a t e o f m a r i n e a u t h o r i t y e s t a b l i s h m e n t i n a n u m b e r o f o c e a n s t a t e s i n c l u d i n g t h e U n i t e d S t a t e s , C a n a d a , C h i n a , J a p a n a n d K o r e a . I t g o e s o n t o d i s c u s s t h e h i s t o r y a n d p r o c e s s o f e s t a b l i s h i n g c o m p e t e n t m a r i n e a u t h o r i t i e s i n T a i w a n . T h e T a i w a n e s e G o v e r n m e n t h a s c o n f i r m e d t h e e s t a b l i s h m e n t o f T h e T a s k F o r c e f o r M a r i t i m e A f f a i r s ( T h e T a s k F o r c e ) . I t i s r e s p o n s i b l e f o r t h e i n t e g r a t i o n o f v a r i o u s m a r i n e a u t h o r i t i e s i n T a i w a n a n d t h e n e w l y e s t a b l i s h e d c o m p e t e n t a u t h o r i t y i s s c h e d u l e d t o c o m e i n t o o p e r a t i o n i n J a n u a r y 2 0 1 2 . T h e T a s k F o r c e w i l l s e r v e a s a u s e f u l s o u r c e o f r e f e r e n c e f o r m a n y s c h o l a r s w o r k i n g i n t h e m a r i n e a n d o c e a n s c i e n c e d i s c i p l i n e s . F u r t h e r m o r e , t h e i n s i g h t f u l a n a l y s i s p r e s e n t e d i n t h i s p a p e r w i l l e n a b l e y o u r r e a d e r s t o a c q u i r e a b e t t e r u n d e r s t a n d i n g o f t h e s c o p e o f m a r i n e a f f a i r s , t h e s t a t u s o f c o m p e t e n t m a r i n e a u t h o r i t i e s i n c e r t a i n c o u n t r i e s , a n d t h e h i s t o r y a n d p r o c e s s i n t h e e s t a b l i s h m e n t o f s u c h a u t h o r i t i e s i n T a i w a n . F o r t h e r e a s o n s s t a t e d , I f e e l t h i s p a p e r p r o v i d e s h i g h l y v a l u a b l e i n f o r m a t i o n s u i t a b l e f o r p u b l i c a t i o n i n y o u r j o u r n a l . S h o u l d t h e e d i t o r h a v e a n y s u g g e s t i o n o r c o m m e n t p l e a s e d o n o t h e s i t a t e t o c o n t a c t u s , w e s h a l l r e s p o n d i m m e d i a t e l y .

huuhungus · 2015-11-25T10:41:22Z

Do you have solutions to work around this issue yet ?
I had exactly the same problem as yours.
I am trying to solve it now.

ricardobrg · 2015-11-25T10:47:37Z

No. I had problems with chinese character too so I changed the script. This project was to convert .doc and .docx to pdf and extract the text to a variable. Now I extract the text straight from the msword files before converting.

huuhungus · 2015-11-25T11:10:17Z

I found a fix, not sure if it's correct but it worked for me ^_^
src/Smalot/PdfParser/Object.php
line 275
comment out this line
//$text .= ' ';
Not completely fix it, but it's at acceptable. I guess we need add some more conditions there

lode · 2015-12-23T19:14:40Z

I've got the same issue since a few days, as they pdfs I'm working with changed. @huuhungus' change works for me as well, but doesn't feel completely good. Could this be an adobe pdf update? It would be great to have this fixed, as it renders parsing quite useless.

I could provide example pdfs from without and with this issue if that can help.

jstrobel · 2016-02-10T16:58:31Z

The fix from @huuhungus works for me, too. I have PDFs which a generated via Qt 4.8.6 (content: wkhtmltopdf 0.12.1).

oliver-ni · 2016-08-16T16:43:13Z

The fix from @huuhungus works for me as well.

farjad-hasan · 2016-10-26T12:18:18Z

I found a fix for this, there is a method in pdfparser-master/src/Smalot/PdfParser/Font.php on line 338.

/**
     * @return int
     */
    protected function getFontSpaceLimit()
    {
        return -50;
    }

Decrease this value to -60 or the value which is more suitable for your pdf text.

Explanation:
Basically, there is a check in code which compares pixels of a character position with this default font space limit, and if its smaller, then it increments the character position which leaves a space between characters. Basically, its a font character spacing related thing, there are some fonts which have larger character spacing and the pdf parser can't able to adjust to it.

davejtoews · 2017-09-01T17:14:07Z

This is an old thread, but the problem persists.

As of v0.10.0, the fix mentioned above by @huuhungus now needs to be done on line 295.

For the files I am working with the fix mentioned by @nuttyprogrammer has no effect whatsoever.

robneu · 2018-01-13T05:06:47Z

+1 for this still being an issue. Above fix also worked in my situation.

lmasforne · 2018-01-20T18:09:05Z

+1 for this still being an issue.
PDFObject line 295
comment out this line
//$text .= ' ';
Work for me

kingafrojoe · 2018-04-03T16:05:17Z

+1 still an issue.
@huuhungus solution worked
I commented lines 295, 308, 338. From inside class PDFObject method getText()

jee7 · 2019-03-16T01:25:31Z

@lmasforne Is there a reason why this committed fix is not a PR?

I'm parsing a PDF exported from Google Docs and it adds arbitrary spaces inside the words. Like this:

th e a va ila ble in fo rm atio n a nd p re se nts it in a fo rm o f a s u rv e y

While it should be like this:

the available information and presents it in a form of a survey.

It works correctly after commenting out the mentioned line 295.

ricardobrg · 2019-05-03T13:37:05Z

Wow, this is still open...

Did anybody tried this fix? @nuttyprogrammer is it working?

I found a fix for this, there is a method in pdfparser-master/src/Smalot/PdfParser/Font.php on line 338.
/**
     * @return int
     */
    protected function getFontSpaceLimit()
    {
        return -50;
    }
Decrease this value to -60 or the value which is more suitable for your pdf text.

Explanation:
Basically, there is a check in code which compares pixels of a character position with this default font space limit, and if its smaller, then it increments the character position which leaves a space between characters. Basically, its a font character spacing related thing, there are some fonts which have larger character spacing and the pdf parser can't able to adjust to it.

Althought @huuhungus fix works this looks like a quick patch fix instead of a solution. I believe that we should take a look at @nuttyprogrammer solution and try to find why the default value is 50 and not 60 and if changing it would solve the problem. That would lead us to a solution and possibly a PR to close this issue.

davejtoews · 2019-05-03T15:02:06Z

@ricardobrg it's been a while since I looked at this but when I tried @nuttyprogrammer's solution it had no effect for me. Only @huuhungus's fix worked.

jee7 · 2019-05-03T18:45:01Z

Doesn't work for me either. I created a new Docs with the following content:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
The quick brown fox jumps over the lazy dog

When I get the PDF from that and use the parser:

$parser = new Parser();
$pdf = $parser->parseContent($rawPDF);
dd($pdf->getText());

I get this:

"""
Lore m ip su m d olo r s it a m et, c o nse cte tu r a dip is cin g e lit , s e d d o e iu sm od te m po r in cid id unt u t\n
\n
la bore e t d olo re m agna a liq ua.\n
\n
The q uic k b ro w n fo x ju m ps o ve r t h e la zy d og
"""

Tried changing the getFontSpaceLimit() to -60, -100 and 0. Did not change anything.
You can also notice that there are too many \n symbols too in addition to the spaces. What works for me is if I comment out the vertical and horizontal offset rows in PDFObject.php (289, 295):

 if ((floatval($x) <= 0) ||
      ($current_position_td['y'] !== false && floatval($y) < floatval($current_position_td['y']))
 ) {
         // vertical offset
         //$text .= "\n";
     } elseif ($current_position_td['x'] !== false && floatval($x) > floatval(
          $current_position_td['x']
     )
  ) {
          // horizontal offset
          //$text .= ' ';
 }

Then I get this:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. The quick brown fox jumps over the lazy dog"

Yes, I have now also lost the real line breaks (in addition to the fake ones), but my goal is to get the word count, which now is correct. Although I guess I could just replace the double line breaks with single ones later... The line breaks are not that much of an issue, but I can not get rid of the arbitrary spaces between the words later on.

rf1234 · 2019-06-05T08:01:17Z

None of the solutions above works for me. Surprisingly I have no issues running the parser locally using XAMPP. The problem only occurs on my Linux Server. Weird.

huuhungus · 2019-06-05T17:01:03Z

@rf1234 what is your php version on server and your local?

rf1234 · 2019-06-10T08:21:37Z

Sorry for being so late with my response! I got a work around running now ... The work around is to eliminate all white space from the parsed text so that it becomes basically illegible. But it doesn't matter for me because I only need it for full text search. Nobody sees it ... I also eliminate all white space from the search strings and then the search works fine. Since I also search in texts where spaces are not eliminated because they don't get parsed I search with a logical OR (string with blanks in it OR string without blanks in it needs to find a match in the texts).

Locally I run PHP 7.2.1 and on the server it is 7.3.6

vishal0520 · 2020-03-03T04:46:27Z

Hi I am still facing this issue can you please help

huuhungus · 2020-03-03T07:22:14Z

Did you try my suggestion? It is not beauty solution but it worked.

…

On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote: Hi I am still facing this issue can you please help — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA> .

vishal0520 · 2020-03-03T07:33:51Z

Yes I tried you solution I am using this package in laravel but it didn't change the result for me

________________________________ From: huuhungus <[email protected]> Sent: Tuesday, March 3, 2020 12:52:16 PM To: smalot/pdfparser <[email protected]> Cc: vishal Singla <[email protected]>; Comment <[email protected]> Subject: Re: [smalot/pdfparser] Inserting white spaces beetween letters (#72) Did you try my suggestion? It is not beauty solution but it worked.

On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote: Hi I am still facing this issue can you please help — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#72?email_source=notifications&email_token=AOW5MSOMXRJB2TFU3I7CQBLRFSV2RA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSMBEQ#issuecomment-593805458>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOW5MSMTHCIVLJHND2RYKQLRFSV2RANCNFSM4BKHEXYA>.

huuhungus · 2020-03-04T07:28:06Z

I just tried to download newest version. It appeared that source code already changed. Filename to change: PDFObject.php Line 295 Text to search: // horizontal offset $text .= ‘ ‘ << comment this out

On Tue, Mar 3, 2020 at 14:33 vishal0520 ***@***.***> wrote: Yes I tried you solution I am using this package in laravel but it didn't change the result for me ________________________________ From: huuhungus ***@***.***> Sent: Tuesday, March 3, 2020 12:52:16 PM To: smalot/pdfparser ***@***.***> Cc: vishal Singla ***@***.***>; Comment ***@***.***> Subject: Re: [smalot/pdfparser] Inserting white spaces beetween letters (#72) Did you try my suggestion? It is not beauty solution but it worked. On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote: > Hi I am still facing this issue can you please help > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584 >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub< #72?email_source=notifications&email_token=AOW5MSOMXRJB2TFU3I7CQBLRFSV2RA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSMBEQ#issuecomment-593805458>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AOW5MSMTHCIVLJHND2RYKQLRFSV2RANCNFSM4BKHEXYA >. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=ABBL6G6G2RNDYIWXNRW32TDRFSXGBA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSM5KI#issuecomment-593809065>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBL6G5KU3E7ID2BE3U4PPDRFSXGBANCNFSM4BKHEXYA> .

-- With Best Regard, HUNG Company: Outsprin Co.

benairs3 · 2021-07-05T12:25:02Z

The solution of @huuhungus worked for me.
Now the file is in: src/Smalot/PdfParser/PDFObject.php
I commented lines 316 and 329.
Now the parser works perfectly.

Thanks.

Issue: smalot#72

eric-lukyamuzi · 2021-10-11T01:25:59Z

Guys i am using v.0.11 and it worked for me by doing this below

File edited: /vendor/smalot/src/PDFObject.php

This Code Block below i changed $text .= ' '; TO $text .= ''; and it worked perfectly for all types of PDFs

                 case 'Td':
                    $args = preg_split('/\s/s', $command[self::COMMAND]);
                    $y    = array_pop($args);
                    $x    = array_pop($args);
                    if ((floatval($x) <= 0) ||
                        ($current_position_td['y'] !== false && floatval($y) < floatval($current_position_td['y']))
                    ) {
                        // vertical offset
                        $text .= "\n";
                    } elseif ($current_position_td['x'] !== false && floatval($x) > floatval(
                            $current_position_td['x']
                        )
                    ) {
                        // horizontal offset
                        $text .= '';
                    }

ChrisSantiago82 · 2022-01-10T14:39:47Z

Solution from @eric-lancelot worked for me. Can we intecrate this into the code?

k00ni · 2022-01-11T09:15:51Z

Would changes from #505 help you here?

ChrisSantiago82 · 2022-01-11T09:17:23Z

Yes, it would :)

lmasforne added a commit to lmasforne/pdfparser that referenced this issue Jan 20, 2018

- Quick fix to this : smalot#72

58fa3da

fycben mentioned this issue Apr 8, 2020

getText() returns text breaking up words with spaces #201

Closed

rubenvanerk mentioned this issue Jun 26, 2020

Space issues in parsing text #314

Closed

PaulBehrendtVentoro mentioned this issue Jul 8, 2020

respect space width when using "Move text position" stream operator #318

Closed

panique added a commit to panique/pdfparser that referenced this issue Nov 21, 2020

fix for random spaces problem (issue smalot#72).patch

521e676

k00ni added bug help wanted labels Jul 6, 2021

dpassola added a commit to dpassola/pdfparser that referenced this issue Jul 28, 2021

[FIXED] Fixed additional spaces from PDF content

e482ded

Issue: smalot#72

k00ni mentioned this issue Jan 11, 2022

Make horizontal offset configurable #505

Merged

k00ni linked a pull request Jan 11, 2022 that will close this issue

Make horizontal offset configurable #505

Merged

DaiND1902 mentioned this issue Jan 12, 2022

Random spaces in text #494

Closed

k00ni removed a link to a pull request Jan 17, 2022

Make horizontal offset configurable #505

Merged

k00ni linked a pull request Jan 17, 2022 that will close this issue

Make horizontal offset configurable #505

Merged

k00ni closed this as completed in #505 Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inserting white spaces beetween letters #72

Inserting white spaces beetween letters #72

ricardobrg commented Jul 4, 2015

huuhungus commented Nov 25, 2015

ricardobrg commented Nov 25, 2015

huuhungus commented Nov 25, 2015

lode commented Dec 23, 2015

jstrobel commented Feb 10, 2016

oliver-ni commented Aug 16, 2016

farjad-hasan commented Oct 26, 2016

davejtoews commented Sep 1, 2017

robneu commented Jan 13, 2018

lmasforne commented Jan 20, 2018

kingafrojoe commented Apr 3, 2018

jee7 commented Mar 16, 2019

ricardobrg commented May 3, 2019

davejtoews commented May 3, 2019

jee7 commented May 3, 2019 •

edited

Loading

rf1234 commented Jun 5, 2019

huuhungus commented Jun 5, 2019

rf1234 commented Jun 10, 2019

vishal0520 commented Mar 3, 2020

huuhungus commented Mar 3, 2020 via email

vishal0520 commented Mar 3, 2020 via email

huuhungus commented Mar 4, 2020 via email

benairs3 commented Jul 5, 2021

eric-lukyamuzi commented Oct 11, 2021

ChrisSantiago82 commented Jan 10, 2022

k00ni commented Jan 11, 2022

ChrisSantiago82 commented Jan 11, 2022

Inserting white spaces beetween letters #72

Inserting white spaces beetween letters #72

Comments

ricardobrg commented Jul 4, 2015

huuhungus commented Nov 25, 2015

ricardobrg commented Nov 25, 2015

huuhungus commented Nov 25, 2015

lode commented Dec 23, 2015

jstrobel commented Feb 10, 2016

oliver-ni commented Aug 16, 2016

farjad-hasan commented Oct 26, 2016

davejtoews commented Sep 1, 2017

robneu commented Jan 13, 2018

lmasforne commented Jan 20, 2018

kingafrojoe commented Apr 3, 2018

jee7 commented Mar 16, 2019

ricardobrg commented May 3, 2019

davejtoews commented May 3, 2019

jee7 commented May 3, 2019 • edited Loading

rf1234 commented Jun 5, 2019

huuhungus commented Jun 5, 2019

rf1234 commented Jun 10, 2019

vishal0520 commented Mar 3, 2020

huuhungus commented Mar 3, 2020 via email

vishal0520 commented Mar 3, 2020 via email

huuhungus commented Mar 4, 2020 via email

benairs3 commented Jul 5, 2021

eric-lukyamuzi commented Oct 11, 2021

ChrisSantiago82 commented Jan 10, 2022

k00ni commented Jan 11, 2022

ChrisSantiago82 commented Jan 11, 2022

jee7 commented May 3, 2019 •

edited

Loading