How to use with languages with non-Latin characters? #219

nerixim · 2025-01-13T17:02:35Z

Hi.
When I try to process PDF in Japanese, it produces gibberish like following.

E -
6
) -
18 BE E# #B
BE
B
5E
A
471 64123
- 5

I've tried setting the language based on code I found inside this repo.

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser_config = ParseFileConfig(
    llm_model_name="gpt-4o",
    # method=method,
    # strategy=strategy,
    model=model,
    language="ja",
    # parsing_instruction=parsing_instruction,
)

parser_builder = ParserBuilder()
parser = parser_builder.build(parser_config)
megaparse = MegaParse(parser)
response = megaparse.load("./document.pdf")
print(response)
megaparse.save("./document.md")

probably related to #92

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-13T17:03:49Z

Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The DoctrParser class in the MegaParse codebase is responsible for text extraction from PDFs and includes language-specific configurations through the TextDetConfig and TextRecoConfig classes ^[1]. However, it seems that there might be limitations or issues with the current support for non-Latin scripts, as highlighted in the related GitHub issue you mentioned.

To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

chloedia · 2025-01-14T09:16:53Z

Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package !

ISKTG · 2025-01-27T15:41:28Z

I have a similar problem i get this errror message:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 327: ordinal not in range(256)

I have tried both the pypi package and the cloned repository and bothe ended in the same error. This is the script i am using with the pypi package currently:

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./example.pdf")
megaparse.save("./output.md")

The example.pdf i am using is in german and contains a table and some images (i dont want to parse the images, only text and tables).

By substituting

#megaparse.save("./output.md")
with open("./output.md", "w", encoding="utf-8") as f:
    f.write(megaparse.last_parsed_document)

I get an output at least however the quality is relatively low, for example ü is interpreted as ii and other things like that. SO my question still stands, is there a way to solve this in a better cleaner way so that the parser can recognize and save letters from german?

dosubot bot added bug Something isn't working labels Jan 13, 2025

dosubot bot added the question Further information is requested label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use with languages with non-Latin characters? #219

How to use with languages with non-Latin characters? #219

nerixim commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

chloedia commented Jan 14, 2025

ISKTG commented Jan 27, 2025 •

edited

Loading

How to use with languages with non-Latin characters? #219

How to use with languages with non-Latin characters? #219

Comments

nerixim commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

chloedia commented Jan 14, 2025

ISKTG commented Jan 27, 2025 • edited Loading

ISKTG commented Jan 27, 2025 •

edited

Loading