Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting white spaces beetween letters #72

Closed
ricardobrg opened this issue Jul 4, 2015 · 27 comments · Fixed by #505
Closed

Inserting white spaces beetween letters #72

ricardobrg opened this issue Jul 4, 2015 · 27 comments · Fixed by #505

Comments

@ricardobrg
Copy link

When I try to extract text from this file (http://billybala.brgweb.com.br/tmp/1435983113.pdf) it inserts white space beetween letters:

The code:

   $fulltext = 'Full text: ';
   $parser = new \Smalot\PdfParser\Parser();
   $pdfsource = $parser->parseFile($pdf);
   $pages  = $pdfsource->getPages();
   $pagecount = count($pages);
   $output .= "Total pages: $pagecount<br>";
   // Loop over each page to extract text.
   foreach ($pages as $page) {
    $fulltext .= utf8_decode($page->getText());
   }
   echo $fulltext;

The output is:

Full text: C 3 9 9 0 0 9 N T O U D e a r E d i t o r : W i t h g r o w i n g p o p u l a t i o n a n d e v e r m o r e a d v a n c e d t e c h n o l o g i e s , n a t u r a l r e s o u r c e s p r o v i d e d b y l a n d c a n n o l o n g e r f u l f i l l t h e i n c r e a s i n g d e m a n d s o f t h e h u m a n p o p u l a t i o n . I n a n a t t e m p t t o e n s u r e t h e i r m a r i n e r i g h t s a n d i n t e r e s t s , m a n y c o u n t r i e s h a v e t a k e n t h e s t e p t o e s t a b l i s h c o m p e t e n t m a r i n e a u t h o r i t i e s . T h e s e a u t h o r i t i e s a r e a s s i g n e d t h e m i s s i o n t o i n t e g r a t e o c e a n p o l i c i e s , a n d h a v e t h e r e s p o n s i b i l i t y t o o v e r s e e v a r i o u s m a r i n e a f f a i r s . T h i s p a p e r g i v e s i n s i g h t i n t o t h e s c o p e o f m a r i n e a f f a i r s , a n d s u m m a r i z e s t h e p r e s e n t s t a t e o f m a r i n e a u t h o r i t y e s t a b l i s h m e n t i n a n u m b e r o f o c e a n s t a t e s i n c l u d i n g t h e U n i t e d S t a t e s , C a n a d a , C h i n a , J a p a n a n d K o r e a . I t g o e s o n t o d i s c u s s t h e h i s t o r y a n d p r o c e s s o f e s t a b l i s h i n g c o m p e t e n t m a r i n e a u t h o r i t i e s i n T a i w a n . T h e T a i w a n e s e G o v e r n m e n t h a s c o n f i r m e d t h e e s t a b l i s h m e n t o f T h e T a s k F o r c e f o r M a r i t i m e A f f a i r s ( T h e T a s k F o r c e ) . I t i s r e s p o n s i b l e f o r t h e i n t e g r a t i o n o f v a r i o u s m a r i n e a u t h o r i t i e s i n T a i w a n a n d t h e n e w l y e s t a b l i s h e d c o m p e t e n t a u t h o r i t y i s s c h e d u l e d t o c o m e i n t o o p e r a t i o n i n J a n u a r y 2 0 1 2 . T h e T a s k F o r c e w i l l s e r v e a s a u s e f u l s o u r c e o f r e f e r e n c e f o r m a n y s c h o l a r s w o r k i n g i n t h e m a r i n e a n d o c e a n s c i e n c e d i s c i p l i n e s . F u r t h e r m o r e , t h e i n s i g h t f u l a n a l y s i s p r e s e n t e d i n t h i s p a p e r w i l l e n a b l e y o u r r e a d e r s t o a c q u i r e a b e t t e r u n d e r s t a n d i n g o f t h e s c o p e o f m a r i n e a f f a i r s , t h e s t a t u s o f c o m p e t e n t m a r i n e a u t h o r i t i e s i n c e r t a i n c o u n t r i e s , a n d t h e h i s t o r y a n d p r o c e s s i n t h e e s t a b l i s h m e n t o f s u c h a u t h o r i t i e s i n T a i w a n . F o r t h e r e a s o n s s t a t e d , I f e e l t h i s p a p e r p r o v i d e s h i g h l y v a l u a b l e i n f o r m a t i o n s u i t a b l e f o r p u b l i c a t i o n i n y o u r j o u r n a l . S h o u l d t h e e d i t o r h a v e a n y s u g g e s t i o n o r c o m m e n t p l e a s e d o n o t h e s i t a t e t o c o n t a c t u s , w e s h a l l r e s p o n d i m m e d i a t e l y .

@huuhungus
Copy link

Do you have solutions to work around this issue yet ?
I had exactly the same problem as yours.
I am trying to solve it now.

@ricardobrg
Copy link
Author

No. I had problems with chinese character too so I changed the script. This project was to convert .doc and .docx to pdf and extract the text to a variable. Now I extract the text straight from the msword files before converting.

@huuhungus
Copy link

I found a fix, not sure if it's correct but it worked for me ^_^
src/Smalot/PdfParser/Object.php
line 275
comment out this line
//$text .= ' ';
Not completely fix it, but it's at acceptable. I guess we need add some more conditions there

@lode
Copy link

lode commented Dec 23, 2015

I've got the same issue since a few days, as they pdfs I'm working with changed. @huuhungus' change works for me as well, but doesn't feel completely good. Could this be an adobe pdf update? It would be great to have this fixed, as it renders parsing quite useless.

I could provide example pdfs from without and with this issue if that can help.

@jstrobel
Copy link

The fix from @huuhungus works for me, too. I have PDFs which a generated via Qt 4.8.6 (content: wkhtmltopdf 0.12.1).

@oliver-ni
Copy link

The fix from @huuhungus works for me as well.

@farjad-hasan
Copy link

I found a fix for this, there is a method in pdfparser-master/src/Smalot/PdfParser/Font.php on line 338.

/**
     * @return int
     */
    protected function getFontSpaceLimit()
    {
        return -50;
    }

Decrease this value to -60 or the value which is more suitable for your pdf text.

Explanation:
Basically, there is a check in code which compares pixels of a character position with this default font space limit, and if its smaller, then it increments the character position which leaves a space between characters. Basically, its a font character spacing related thing, there are some fonts which have larger character spacing and the pdf parser can't able to adjust to it.

@davejtoews
Copy link

This is an old thread, but the problem persists.

As of v0.10.0, the fix mentioned above by @huuhungus now needs to be done on line 295.

For the files I am working with the fix mentioned by @nuttyprogrammer has no effect whatsoever.

@robneu
Copy link

robneu commented Jan 13, 2018

+1 for this still being an issue. Above fix also worked in my situation.

@lmasforne
Copy link

+1 for this still being an issue.
PDFObject line 295
comment out this line
//$text .= ' ';
Work for me

lmasforne added a commit to lmasforne/pdfparser that referenced this issue Jan 20, 2018
@kingafrojoe
Copy link

+1 still an issue.
@huuhungus solution worked
I commented lines 295, 308, 338. From inside class PDFObject method getText()

@jee7
Copy link
Contributor

jee7 commented Mar 16, 2019

@lmasforne Is there a reason why this committed fix is not a PR?

I'm parsing a PDF exported from Google Docs and it adds arbitrary spaces inside the words. Like this:

th e a va ila ble in fo rm atio n a nd p re se nts it in a fo rm o f a s u rv e y

While it should be like this:

the available information and presents it in a form of a survey.

It works correctly after commenting out the mentioned line 295.

@ricardobrg
Copy link
Author

Wow, this is still open...

Did anybody tried this fix? @nuttyprogrammer is it working?

I found a fix for this, there is a method in pdfparser-master/src/Smalot/PdfParser/Font.php on line 338.

/**
     * @return int
     */
    protected function getFontSpaceLimit()
    {
        return -50;
    }

Decrease this value to -60 or the value which is more suitable for your pdf text.

Explanation:
Basically, there is a check in code which compares pixels of a character position with this default font space limit, and if its smaller, then it increments the character position which leaves a space between characters. Basically, its a font character spacing related thing, there are some fonts which have larger character spacing and the pdf parser can't able to adjust to it.

Althought @huuhungus fix works this looks like a quick patch fix instead of a solution. I believe that we should take a look at @nuttyprogrammer solution and try to find why the default value is 50 and not 60 and if changing it would solve the problem. That would lead us to a solution and possibly a PR to close this issue.

@davejtoews
Copy link

@ricardobrg it's been a while since I looked at this but when I tried @nuttyprogrammer's solution it had no effect for me. Only @huuhungus's fix worked.

@jee7
Copy link
Contributor

jee7 commented May 3, 2019

Doesn't work for me either. I created a new Docs with the following content:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
The quick brown fox jumps over the lazy dog

When I get the PDF from that and use the parser:

$parser = new Parser();
$pdf = $parser->parseContent($rawPDF);
dd($pdf->getText());

I get this:

"""
Lore m ip su m d olo r s it a m et, c o nse cte tu r a dip is cin g e lit , s e d d o e iu sm od te m po r in cid id unt u t\n
\n
la bore e t d olo re m agna a liq ua.\n
\n
The q uic k b ro w n fo x ju m ps o ve r t h e la zy d og
"""

Tried changing the getFontSpaceLimit() to -60, -100 and 0. Did not change anything.
You can also notice that there are too many \n symbols too in addition to the spaces. What works for me is if I comment out the vertical and horizontal offset rows in PDFObject.php (289, 295):

 if ((floatval($x) <= 0) ||
      ($current_position_td['y'] !== false && floatval($y) < floatval($current_position_td['y']))
 ) {
         // vertical offset
         //$text .= "\n";
     } elseif ($current_position_td['x'] !== false && floatval($x) > floatval(
          $current_position_td['x']
     )
  ) {
          // horizontal offset
          //$text .= ' ';
 }

Then I get this:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. The quick brown fox jumps over the lazy dog"

Yes, I have now also lost the real line breaks (in addition to the fake ones), but my goal is to get the word count, which now is correct. Although I guess I could just replace the double line breaks with single ones later... The line breaks are not that much of an issue, but I can not get rid of the arbitrary spaces between the words later on.

@rf1234
Copy link

rf1234 commented Jun 5, 2019

None of the solutions above works for me. Surprisingly I have no issues running the parser locally using XAMPP. The problem only occurs on my Linux Server. Weird.

@huuhungus
Copy link

@rf1234 what is your php version on server and your local?

@rf1234
Copy link

rf1234 commented Jun 10, 2019

Sorry for being so late with my response! I got a work around running now ... The work around is to eliminate all white space from the parsed text so that it becomes basically illegible. But it doesn't matter for me because I only need it for full text search. Nobody sees it ... I also eliminate all white space from the search strings and then the search works fine. Since I also search in texts where spaces are not eliminated because they don't get parsed I search with a logical OR (string with blanks in it OR string without blanks in it needs to find a match in the texts).

Locally I run PHP 7.2.1 and on the server it is 7.3.6

@vishal0520
Copy link

Hi I am still facing this issue can you please help

@huuhungus
Copy link

huuhungus commented Mar 3, 2020 via email

@vishal0520
Copy link

vishal0520 commented Mar 3, 2020 via email

@huuhungus
Copy link

huuhungus commented Mar 4, 2020 via email

@benairs3
Copy link

benairs3 commented Jul 5, 2021

The solution of @huuhungus worked for me.
Now the file is in: src/Smalot/PdfParser/PDFObject.php
I commented lines 316 and 329.
Now the parser works perfectly.

Thanks.

dpassola added a commit to dpassola/pdfparser that referenced this issue Jul 28, 2021
@eric-lukyamuzi
Copy link

Guys i am using v.0.11 and it worked for me by doing this below

File edited: /vendor/smalot/src/PDFObject.php

This Code Block below i changed $text .= ' '; TO $text .= ''; and it worked perfectly for all types of PDFs

                 case 'Td':
                    $args = preg_split('/\s/s', $command[self::COMMAND]);
                    $y    = array_pop($args);
                    $x    = array_pop($args);
                    if ((floatval($x) <= 0) ||
                        ($current_position_td['y'] !== false && floatval($y) < floatval($current_position_td['y']))
                    ) {
                        // vertical offset
                        $text .= "\n";
                    } elseif ($current_position_td['x'] !== false && floatval($x) > floatval(
                            $current_position_td['x']
                        )
                    ) {
                        // horizontal offset
                        $text .= '';
                    }

@ChrisSantiago82
Copy link

Solution from @eric-lancelot worked for me. Can we intecrate this into the code?

@k00ni
Copy link
Collaborator

k00ni commented Jan 11, 2022

Would changes from #505 help you here?

@ChrisSantiago82
Copy link

Yes, it would :)

@k00ni k00ni linked a pull request Jan 11, 2022 that will close this issue
@k00ni k00ni removed a link to a pull request Jan 17, 2022
@k00ni k00ni linked a pull request Jan 17, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.