-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use with languages with non-Latin characters? #219
Comments
Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing. The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package ! |
I have a similar problem i get this errror message: I have tried both the pypi package and the cloned repository and bothe ended in the same error. This is the script i am using with the pypi package currently:
The example.pdf i am using is in german and contains a table and some images (i dont want to parse the images, only text and tables). By substituting
I get an output at least however the quality is relatively low, for example ü is interpreted as ii and other things like that. SO my question still stands, is there a way to solve this in a better cleaner way so that the parser can recognize and save letters from german? |
Hi.
When I try to process PDF in Japanese, it produces gibberish like following.
I've tried setting the language based on code I found inside this repo.
probably related to #92
The text was updated successfully, but these errors were encountered: