-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inserting white spaces beetween letters #72
Comments
Do you have solutions to work around this issue yet ? |
No. I had problems with chinese character too so I changed the script. This project was to convert .doc and .docx to pdf and extract the text to a variable. Now I extract the text straight from the msword files before converting. |
I found a fix, not sure if it's correct but it worked for me ^_^ |
I've got the same issue since a few days, as they pdfs I'm working with changed. @huuhungus' change works for me as well, but doesn't feel completely good. Could this be an adobe pdf update? It would be great to have this fixed, as it renders parsing quite useless. I could provide example pdfs from without and with this issue if that can help. |
The fix from @huuhungus works for me, too. I have PDFs which a generated via Qt 4.8.6 (content: wkhtmltopdf 0.12.1). |
The fix from @huuhungus works for me as well. |
I found a fix for this, there is a method in pdfparser-master/src/Smalot/PdfParser/Font.php on line 338.
Decrease this value to -60 or the value which is more suitable for your pdf text. Explanation: |
This is an old thread, but the problem persists. As of v0.10.0, the fix mentioned above by @huuhungus now needs to be done on line 295. For the files I am working with the fix mentioned by @nuttyprogrammer has no effect whatsoever. |
+1 for this still being an issue. Above fix also worked in my situation. |
+1 for this still being an issue. |
+1 still an issue. |
@lmasforne Is there a reason why this committed fix is not a PR? I'm parsing a PDF exported from Google Docs and it adds arbitrary spaces inside the words. Like this:
While it should be like this:
It works correctly after commenting out the mentioned line 295. |
Wow, this is still open... Did anybody tried this fix? @nuttyprogrammer is it working?
Althought @huuhungus fix works this looks like a quick patch fix instead of a solution. I believe that we should take a look at @nuttyprogrammer solution and try to find why the default value is 50 and not 60 and if changing it would solve the problem. That would lead us to a solution and possibly a PR to close this issue. |
@ricardobrg it's been a while since I looked at this but when I tried @nuttyprogrammer's solution it had no effect for me. Only @huuhungus's fix worked. |
Doesn't work for me either. I created a new Docs with the following content:
When I get the PDF from that and use the parser: $parser = new Parser();
$pdf = $parser->parseContent($rawPDF);
dd($pdf->getText()); I get this:
Tried changing the if ((floatval($x) <= 0) ||
($current_position_td['y'] !== false && floatval($y) < floatval($current_position_td['y']))
) {
// vertical offset
//$text .= "\n";
} elseif ($current_position_td['x'] !== false && floatval($x) > floatval(
$current_position_td['x']
)
) {
// horizontal offset
//$text .= ' ';
} Then I get this:
Yes, I have now also lost the real line breaks (in addition to the fake ones), but my goal is to get the word count, which now is correct. Although I guess I could just replace the double line breaks with single ones later... The line breaks are not that much of an issue, but I can not get rid of the arbitrary spaces between the words later on. |
None of the solutions above works for me. Surprisingly I have no issues running the parser locally using XAMPP. The problem only occurs on my Linux Server. Weird. |
@rf1234 what is your php version on server and your local? |
Sorry for being so late with my response! I got a work around running now ... The work around is to eliminate all white space from the parsed text so that it becomes basically illegible. But it doesn't matter for me because I only need it for full text search. Nobody sees it ... I also eliminate all white space from the search strings and then the search works fine. Since I also search in texts where spaces are not eliminated because they don't get parsed I search with a logical OR (string with blanks in it OR string without blanks in it needs to find a match in the texts). Locally I run PHP 7.2.1 and on the server it is 7.3.6 |
Hi I am still facing this issue can you please help |
Did you try my suggestion? It is not beauty solution but it worked.
…On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote:
Hi I am still facing this issue can you please help
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA>
.
|
Yes I tried you solution I am using this package in laravel but it didn't change the result for me
________________________________
From: huuhungus <[email protected]>
Sent: Tuesday, March 3, 2020 12:52:16 PM
To: smalot/pdfparser <[email protected]>
Cc: vishal Singla <[email protected]>; Comment <[email protected]>
Subject: Re: [smalot/pdfparser] Inserting white spaces beetween letters (#72)
Did you try my suggestion? It is not beauty solution but it worked.
On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote:
Hi I am still facing this issue can you please help
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#72?email_source=notifications&email_token=AOW5MSOMXRJB2TFU3I7CQBLRFSV2RA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSMBEQ#issuecomment-593805458>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOW5MSMTHCIVLJHND2RYKQLRFSV2RANCNFSM4BKHEXYA>.
|
I just tried to download newest version. It appeared that source code
already changed.
Filename to change: PDFObject.php Line 295
Text to search:
// horizontal offset
$text .= ‘ ‘ << comment this out
On Tue, Mar 3, 2020 at 14:33 vishal0520 ***@***.***> wrote:
Yes I tried you solution I am using this package in laravel but it didn't
change the result for me
________________________________
From: huuhungus ***@***.***>
Sent: Tuesday, March 3, 2020 12:52:16 PM
To: smalot/pdfparser ***@***.***>
Cc: vishal Singla ***@***.***>; Comment ***@***.***>
Subject: Re: [smalot/pdfparser] Inserting white spaces beetween letters
(#72)
Did you try my suggestion? It is not beauty solution but it worked.
On Tue, Mar 3, 2020, 11:46 AM vishal0520 ***@***.***> wrote:
> Hi I am still facing this issue can you please help
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#72?email_source=notifications&email_token=ABBL6G6VPYWSFG6VELL7I53RFSDSJA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSCJUA#issuecomment-593765584
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ABBL6G5I4FG3X34PXPZPE5TRFSDSJANCNFSM4BKHEXYA
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<
#72?email_source=notifications&email_token=AOW5MSOMXRJB2TFU3I7CQBLRFSV2RA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSMBEQ#issuecomment-593805458>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AOW5MSMTHCIVLJHND2RYKQLRFSV2RANCNFSM4BKHEXYA
>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#72?email_source=notifications&email_token=ABBL6G6G2RNDYIWXNRW32TDRFSXGBA5CNFSM4BKHEXYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSM5KI#issuecomment-593809065>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABBL6G5KU3E7ID2BE3U4PPDRFSXGBANCNFSM4BKHEXYA>
.
--
With Best Regard,
HUNG
Company: Outsprin Co.
|
The solution of @huuhungus worked for me. Thanks. |
Guys i am using v.0.11 and it worked for me by doing this below File edited: /vendor/smalot/src/PDFObject.php This Code Block below i changed $text .= ' '; TO $text .= ''; and it worked perfectly for all types of PDFs
|
Solution from @eric-lancelot worked for me. Can we intecrate this into the code? |
Would changes from #505 help you here? |
Yes, it would :) |
When I try to extract text from this file (http://billybala.brgweb.com.br/tmp/1435983113.pdf) it inserts white space beetween letters:
The code:
The output is:
Full text: C 3 9 9 0 0 9 N T O U D e a r E d i t o r : W i t h g r o w i n g p o p u l a t i o n a n d e v e r m o r e a d v a n c e d t e c h n o l o g i e s , n a t u r a l r e s o u r c e s p r o v i d e d b y l a n d c a n n o l o n g e r f u l f i l l t h e i n c r e a s i n g d e m a n d s o f t h e h u m a n p o p u l a t i o n . I n a n a t t e m p t t o e n s u r e t h e i r m a r i n e r i g h t s a n d i n t e r e s t s , m a n y c o u n t r i e s h a v e t a k e n t h e s t e p t o e s t a b l i s h c o m p e t e n t m a r i n e a u t h o r i t i e s . T h e s e a u t h o r i t i e s a r e a s s i g n e d t h e m i s s i o n t o i n t e g r a t e o c e a n p o l i c i e s , a n d h a v e t h e r e s p o n s i b i l i t y t o o v e r s e e v a r i o u s m a r i n e a f f a i r s . T h i s p a p e r g i v e s i n s i g h t i n t o t h e s c o p e o f m a r i n e a f f a i r s , a n d s u m m a r i z e s t h e p r e s e n t s t a t e o f m a r i n e a u t h o r i t y e s t a b l i s h m e n t i n a n u m b e r o f o c e a n s t a t e s i n c l u d i n g t h e U n i t e d S t a t e s , C a n a d a , C h i n a , J a p a n a n d K o r e a . I t g o e s o n t o d i s c u s s t h e h i s t o r y a n d p r o c e s s o f e s t a b l i s h i n g c o m p e t e n t m a r i n e a u t h o r i t i e s i n T a i w a n . T h e T a i w a n e s e G o v e r n m e n t h a s c o n f i r m e d t h e e s t a b l i s h m e n t o f T h e T a s k F o r c e f o r M a r i t i m e A f f a i r s ( T h e T a s k F o r c e ) . I t i s r e s p o n s i b l e f o r t h e i n t e g r a t i o n o f v a r i o u s m a r i n e a u t h o r i t i e s i n T a i w a n a n d t h e n e w l y e s t a b l i s h e d c o m p e t e n t a u t h o r i t y i s s c h e d u l e d t o c o m e i n t o o p e r a t i o n i n J a n u a r y 2 0 1 2 . T h e T a s k F o r c e w i l l s e r v e a s a u s e f u l s o u r c e o f r e f e r e n c e f o r m a n y s c h o l a r s w o r k i n g i n t h e m a r i n e a n d o c e a n s c i e n c e d i s c i p l i n e s . F u r t h e r m o r e , t h e i n s i g h t f u l a n a l y s i s p r e s e n t e d i n t h i s p a p e r w i l l e n a b l e y o u r r e a d e r s t o a c q u i r e a b e t t e r u n d e r s t a n d i n g o f t h e s c o p e o f m a r i n e a f f a i r s , t h e s t a t u s o f c o m p e t e n t m a r i n e a u t h o r i t i e s i n c e r t a i n c o u n t r i e s , a n d t h e h i s t o r y a n d p r o c e s s i n t h e e s t a b l i s h m e n t o f s u c h a u t h o r i t i e s i n T a i w a n . F o r t h e r e a s o n s s t a t e d , I f e e l t h i s p a p e r p r o v i d e s h i g h l y v a l u a b l e i n f o r m a t i o n s u i t a b l e f o r p u b l i c a t i o n i n y o u r j o u r n a l . S h o u l d t h e e d i t o r h a v e a n y s u g g e s t i o n o r c o m m e n t p l e a s e d o n o t h e s i t a t e t o c o n t a c t u s , w e s h a l l r e s p o n d i m m e d i a t e l y .
The text was updated successfully, but these errors were encountered: