Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use with languages with non-Latin characters? #219

Open
nerixim opened this issue Jan 13, 2025 · 3 comments
Open

How to use with languages with non-Latin characters? #219

nerixim opened this issue Jan 13, 2025 · 3 comments
Labels
bug Something isn't working question Further information is requested

Comments

@nerixim
Copy link

nerixim commented Jan 13, 2025

Hi.
When I try to process PDF in Japanese, it produces gibberish like following.

E -
6
) -
18 BE E# #B
BE
B
5E
A
471 64123
- 5

I've tried setting the language based on code I found inside this repo.

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser_config = ParseFileConfig(
    llm_model_name="gpt-4o",
    # method=method,
    # strategy=strategy,
    model=model,
    language="ja",
    # parsing_instruction=parsing_instruction,
)

parser_builder = ParserBuilder()
parser = parser_builder.build(parser_config)
megaparse = MegaParse(parser)
response = megaparse.load("./document.pdf")
print(response)
megaparse.save("./document.md")

probably related to #92

Copy link

dosubot bot commented Jan 13, 2025

Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The DoctrParser class in the MegaParse codebase is responsible for text extraction from PDFs and includes language-specific configurations through the TextDetConfig and TextRecoConfig classes [1]. However, it seems that there might be limitations or issues with the current support for non-Latin scripts, as highlighted in the related GitHub issue you mentioned.

To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added bug Something isn't working labels Jan 13, 2025
@chloedia
Copy link
Collaborator

Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package !

@dosubot dosubot bot added the question Further information is requested label Jan 14, 2025
@ISKTG
Copy link

ISKTG commented Jan 27, 2025

I have a similar problem i get this errror message:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 327: ordinal not in range(256)

I have tried both the pypi package and the cloned repository and bothe ended in the same error. This is the script i am using with the pypi package currently:

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./example.pdf")
megaparse.save("./output.md")

The example.pdf i am using is in german and contains a table and some images (i dont want to parse the images, only text and tables).

By substituting

#megaparse.save("./output.md")
with open("./output.md", "w", encoding="utf-8") as f:
    f.write(megaparse.last_parsed_document)

I get an output at least however the quality is relatively low, for example ü is interpreted as ii and other things like that. SO my question still stands, is there a way to solve this in a better cleaner way so that the parser can recognize and save letters from german?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants