-
-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Encoding still being overridden even after fix to #371. #377
Comments
@xuxoramos Thanks for reporting it. Can you paste the actual code and full error message without trimming? I can't reproduce your error on my end. Also, can you tell me how to install tabula-py? Please share me Here is my result: I tried to parse the PDF you provided. No error happens. >>> import tabula
>>> tabula.read_pdf("tmp.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1", pages="all")
[ Activity
0 Activity Code Name and Definition Code
1 002 Self-Service AJCC Employment and Workforce...
2 NaN
3 This activity is system generated when an indi...
4 workforce information available in CalJOBS. Wo...
5 as: local performance, availability of support...
6 compensation, and performance and program cost...
7 NaN
...snip... |
This is the entire error output:
I installed |
Hmm, that sounds weird. I can find that conda-forge's latest version is still v2.7.0. https://anaconda.org/conda-forge/tabula-py Anyway, your log shows that you are using the subprocess, not jpype. Hence, #371 is unrelated because it is jpype related issue. Also, I tried Jupyter and ipython on my Windows machine, but I can't reproduce the issue. In [1]: import tabula
...:
...: tabula.read_pdf("tmp.pdf", pages="all", encoding="windows-1252", pandas_options={"encoding":"windows-1252"},jav
...: a_options=["-Dfile.encoding=windows-1252"])
Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'
Out[1]:
[ Activity
0 Activity Code Name and Definition Code
1 002 Self-Service AJCC Employment and Workforce...
2 NaN
...snip... Does it happen just after launching jupyter/ipython? I guess you changed the
This suggests that After supporting jpype in tabula-py, tabula doesn't allow the change of |
Made a potential mitigation on #378. Please try the master branch code and give me a feedback if any. |
Summary
Still having issues with encoding even after fix to #371. Passing "latin-1", "cp1252" and "ISO-8859-1" encoding options to all three of java, tabula-py and pandas still returns an error saying UTF-8 is unable to encode.
Did you read the FAQ?
Did you search GitHub Discussions?
(Optional) PDF URL
https://edd.ca.gov/siteassets/files/jobs_and_training/pubs/wsd19-06att1.pdf
About your environment
What did you do when you faced the problem?
Looked at source code, and fix to #371 is still there. Passed "ISO-8859-1", "latin-1", "cp1252" and "windows-1252" encoding options to all three of Java, Pandas and tabula-py, both separately and all together, as follows:
tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")
Code
tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")
Expected behavior
Obtain all tables in the PDF
Actual behavior
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte
Related issues
#371
The text was updated successfully, but these errors were encountered: