Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract for Arabic Text in Python - Encoding Issue #2562

Closed
ali-cognitro opened this issue Jul 13, 2019 · 1 comment
Closed

tesseract for Arabic Text in Python - Encoding Issue #2562

ali-cognitro opened this issue Jul 13, 2019 · 1 comment

Comments

@ali-cognitro
Copy link

Hi,

I'm trying to use tesseract to recognize Arabic text from PDF files by converting the pdf files into images then text.

I'm using the following code, and getting encoding error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to "

My Code:

`# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

Path of the pdf

PDF_file = "109645-36.pdf"

'''
Part #1 : Converting PDF to images
'''

Store all the pages of the PDF in a variable

pages = convert_from_path(PDF_file, 500)

Counter to store images of each page of PDF to image

image_counter = 1

Iterate through all the pages stored above

for page in pages:

# Declaring filename for each page of PDF as JPG 
# For each page, filename will be: 
# PDF page 1 -> page_1.jpg 
# PDF page 2 -> page_2.jpg 
# PDF page 3 -> page_3.jpg 
# .... 
# PDF page n -> page_n.jpg 
filename = "page_"+str(image_counter)+".jpg"
  
# Save the image of the page in system 
page.save(filename, 'JPEG') 

# Increment the counter to update filename 
image_counter = image_counter + 1

'''
Part #2 - Recognizing text from the images using OCR
'''

3

Variable to get count of total number of pages

filelimit = image_counter-1

Creating a text file to write the output

outfile = "out_text_1.txt"

Open the file in append mode so that

All contents of all images are added to the same file

f = open(outfile, "a")

Iterate from 1 to total number of pages

for i in range(1, filelimit + 1):

# Set filename to recognize text from 
# Again, these files will be: 
# page_1.jpg 
# page_2.jpg 
# .... 
# page_n.jpg 
filename = "page_"+str(i)+".jpg"
	
# Recognize the text as string in image using pytesserct 
text = str(((pytesseract.image_to_string(Image.open(filename),lang='ara')))) 

# The recognized text is stored in variable text 
# Any string processing may be applied on text 
# Here, basic formatting has been done: 
# In many PDFs, at line ending, if a word can't 
# be written fully, a 'hyphen' is added. 
# The rest of the word is written in the next line 
# Eg: This is a sample text this word here GeeksF- 
# orGeeks is half on first line, remaining on next. 
# To remove this, we replace every '-\n' to ''. 
text = text.replace('-\n', '')	 

# Finally, write the processed text to the file. 
f.write(str(text))

Close the file after writing all the text.

f.close() `

And getting this error:

`---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
in ()
39
40 # Finally, write the processed text to the file.
---> 41 f.write(str(text))
42
43 # Close the file after writing all the text.

C:\ProgramData\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to
`

I have apply this code to the attached PDF file as a sample
109645-36.pdf

@zdenop
Copy link
Contributor

zdenop commented Jul 13, 2019

Please respect guidelines for posting issue: we do not provide support for 3rd party projects.
And from your description this is even not tesseract issue but clear python.

@zdenop zdenop closed this as completed Jul 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants