tesseract for Arabic Text in Python - Encoding Issue #2562

ali-cognitro · 2019-07-13T10:00:07Z

Hi,

I'm trying to use tesseract to recognize Arabic text from PDF files by converting the pdf files into images then text.

I'm using the following code, and getting encoding error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to "

My Code:

`# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

Path of the pdf

PDF_file = "109645-36.pdf"

'''
Part #1 : Converting PDF to images
'''

Store all the pages of the PDF in a variable

pages = convert_from_path(PDF_file, 500)

Counter to store images of each page of PDF to image

image_counter = 1

Iterate through all the pages stored above

for page in pages:

# Declaring filename for each page of PDF as JPG 
# For each page, filename will be: 
# PDF page 1 -> page_1.jpg 
# PDF page 2 -> page_2.jpg 
# PDF page 3 -> page_3.jpg 
# .... 
# PDF page n -> page_n.jpg 
filename = "page_"+str(image_counter)+".jpg"
  
# Save the image of the page in system 
page.save(filename, 'JPEG') 

# Increment the counter to update filename 
image_counter = image_counter + 1

'''
Part #2 - Recognizing text from the images using OCR
'''

3 Variable to get count of total number of pages

filelimit = image_counter-1

Creating a text file to write the output

outfile = "out_text_1.txt"

Open the file in append mode so that

All contents of all images are added to the same file

f = open(outfile, "a")

Iterate from 1 to total number of pages

for i in range(1, filelimit + 1):

# Set filename to recognize text from 
# Again, these files will be: 
# page_1.jpg 
# page_2.jpg 
# .... 
# page_n.jpg 
filename = "page_"+str(i)+".jpg"
	
# Recognize the text as string in image using pytesserct 
text = str(((pytesseract.image_to_string(Image.open(filename),lang='ara')))) 

# The recognized text is stored in variable text 
# Any string processing may be applied on text 
# Here, basic formatting has been done: 
# In many PDFs, at line ending, if a word can't 
# be written fully, a 'hyphen' is added. 
# The rest of the word is written in the next line 
# Eg: This is a sample text this word here GeeksF- 
# orGeeks is half on first line, remaining on next. 
# To remove this, we replace every '-\n' to ''. 
text = text.replace('-\n', '')	 

# Finally, write the processed text to the file. 
f.write(str(text))

Close the file after writing all the text.

f.close() `

And getting this error:

`---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
in ()
39
40 # Finally, write the processed text to the file.
---> 41 f.write(str(text))
42
43 # Close the file after writing all the text.

C:\ProgramData\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to
`

I have apply this code to the attached PDF file as a sample
109645-36.pdf

The text was updated successfully, but these errors were encountered:

zdenop · 2019-07-13T15:38:47Z

Please respect guidelines for posting issue: we do not provide support for 3rd party projects.
And from your description this is even not tesseract issue but clear python.

zdenop closed this as completed Jul 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract for Arabic Text in Python - Encoding Issue #2562

tesseract for Arabic Text in Python - Encoding Issue #2562

ali-cognitro commented Jul 13, 2019

zdenop commented Jul 13, 2019

tesseract for Arabic Text in Python - Encoding Issue #2562

tesseract for Arabic Text in Python - Encoding Issue #2562

Comments

ali-cognitro commented Jul 13, 2019

Path of the pdf

Store all the pages of the PDF in a variable

Counter to store images of each page of PDF to image

Iterate through all the pages stored above

3

Variable to get count of total number of pages

Creating a text file to write the output

Open the file in append mode so that

All contents of all images are added to the same file

Iterate from 1 to total number of pages

Close the file after writing all the text.

zdenop commented Jul 13, 2019