You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use tesseract to recognize Arabic text from PDF files by converting the pdf files into images then text.
I'm using the following code, and getting encoding error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to "
My Code:
`# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
Counter to store images of each page of PDF to image
image_counter = 1
Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
'''
Part #2 - Recognizing text from the images using OCR
'''
3
Variable to get count of total number of pages
filelimit = image_counter-1
Creating a text file to write the output
outfile = "out_text_1.txt"
Open the file in append mode so that
All contents of all images are added to the same file
f = open(outfile, "a")
Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text = str(((pytesseract.image_to_string(Image.open(filename),lang='ara'))))
# The recognized text is stored in variable text
# Any string processing may be applied on text
# Here, basic formatting has been done:
# In many PDFs, at line ending, if a word can't
# be written fully, a 'hyphen' is added.
# The rest of the word is written in the next line
# Eg: This is a sample text this word here GeeksF-
# orGeeks is half on first line, remaining on next.
# To remove this, we replace every '-\n' to ''.
text = text.replace('-\n', '')
# Finally, write the processed text to the file.
f.write(str(text))
Close the file after writing all the text.
f.close() `
And getting this error:
`---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
in ()
39
40 # Finally, write the processed text to the file.
---> 41 f.write(str(text))
42
43 # Close the file after writing all the text.
C:\ProgramData\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to
`
I have apply this code to the attached PDF file as a sample 109645-36.pdf
The text was updated successfully, but these errors were encountered:
Please respect guidelines for posting issue: we do not provide support for 3rd party projects.
And from your description this is even not tesseract issue but clear python.
Hi,
I'm trying to use tesseract to recognize Arabic text from PDF files by converting the pdf files into images then text.
I'm using the following code, and getting encoding error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to "
My Code:
`# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
Path of the pdf
PDF_file = "109645-36.pdf"
'''
Part #1 : Converting PDF to images
'''
Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500)
Counter to store images of each page of PDF to image
image_counter = 1
Iterate through all the pages stored above
for page in pages:
'''
Part #2 - Recognizing text from the images using OCR
'''
3
Variable to get count of total number of pages
filelimit = image_counter-1
Creating a text file to write the output
outfile = "out_text_1.txt"
Open the file in append mode so that
All contents of all images are added to the same file
f = open(outfile, "a")
Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
Close the file after writing all the text.
f.close() `
And getting this error:
`---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
in ()
39
40 # Finally, write the processed text to the file.
---> 41 f.write(str(text))
42
43 # Close the file after writing all the text.
C:\ProgramData\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to
`
I have apply this code to the attached PDF file as a sample
109645-36.pdf
The text was updated successfully, but these errors were encountered: