Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noncompliant JSON files tripping up Windows #3779

Open
khemarato opened this issue Nov 7, 2024 · 4 comments
Open

Noncompliant JSON files tripping up Windows #3779

khemarato opened this issue Nov 7, 2024 · 4 comments
Assignees

Comments

@khemarato
Copy link

khemarato commented Nov 7, 2024

Charles Patton is having some trouble with the BilaraIO scripts parsing the JSON on his machine. I took the liberty of running jsonlint across all the JSON files in the repository and indeed a few of them are not standards compliant, namely (at the time of this writing):

./variant/lzh/sct/sutta/ma/ma5_variant-lzh-sct.json
./variant/lzh/sct/sutta/ma/ma200_variant-lzh-sct.json
./variant/lzh/sct/sutta/ma/ma7_variant-lzh-sct.json
./variant/lzh/sct/sutta/ma/ma31_variant-lzh-sct.json
./.scripts/bilara-html-tsv/package-lock.json
./root/lzh/sct/sutta/ma/ma7_root-lzh-sct.json
./root/lzh/sct/sutta/ma/ma31_root-lzh-sct.json
./root/lzh/sct/sutta/ma/ma5_root-lzh-sct.json
./root/lzh/sct/sutta/ma/ma8_root-lzh-sct.json
./root/lzh/sct/sutta/ma/ma9_root-lzh-sct.json
./.helpers/gaiji/gaiji.json
./html/pli/ms/sutta/sn/sn46/sn46.53_html.json
./html/pli/ms/sutta/an/an7/an7.63_html.json

In particular, non-BMP characters (such as U+28114) seem to be causing the default Windows import json parser to choke.

There are a number of possible solutions here. The simplest would be to simply replace the standard json library with something more robust (ujson?) instead. @winterdharma - can you try that and let us know if it solves your issue?

That failing, we'll have to figure out the best way to make these json files compliant (escaping non-Unicode characters?)

@winterdharma
Copy link
Collaborator

Hello.

So. I attempted to implement @khemarato's suggestion. Unfortunately, a series of technical difficulties ensued with I installed ujson. Pip asked me to upgrade it, but upgrading it broke my python install. I then spent an hour reinstalling python, etc. Now, however, I haven't been able to get the python script output to go to VS Code's terminal window. Instead it runs in a command window. So .... I don't know what the crash report is, but the export.py script is still choking on the Chinese texts in Bilara when I attempt to create a .tsv file to test. I can export a Pali text like MN, but not MA.

So, I believe we have a problem with those dreaded triple-byte unicode characters that represent rare Chinese characters. But it's just a hunch.

@winterdharma
Copy link
Collaborator

winterdharma commented Nov 12, 2024

Okay, I realized I could run it in a windows command prompt. So here is the new crash report:
Traceback (most recent call last): File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\common.py", line 78, in json_load return json.load(f) ~~~~~~~~~^^^ ujson.JSONDecodeError: Expected object or value
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\sheet_export.py", line 54, in <module> save_sheet(rows, args.out) ~~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\sheet_export.py", line 29, in save_sheet writer.writerows(rows) ~~~~~~~~~~~~~~~~^^^^^^ File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\get_data.py", line 32, in yield_rows file_data = json_load(file) File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\common.py", line 79, in json_load except json.decoder.JSONDecodeError as e: ^^^^^^^^^^^^ AttributeError: module 'ujson' has no attribute 'decoder'. Did you mean: 'decode'?

@winterdharma
Copy link
Collaborator

winterdharma commented Nov 12, 2024

Just as some information for anyone trying to debug this:

Rare Chinese characters have been added to Unicode that spill over the normal limit of characters that can be encoded to two byes of data. These characters have an extra byte added at their start to make a three-byte character. When they are added to UTF-8 text, we end up with mostly double-byte characters with a few triple-byte characters sprinkled in (for fun, I guess!).

Normal string iterators DO NOT know this about UTF-8. They happily take two bytes for every character, decode them, and then process them. When they hit a triple byte character, they do this: The iterator takes the first two bytes, decodes them, and processes them as a character. Then the iterator takes the third byte of the triple-byte character and the first byte of the next character, decode it, and processes it as a character. After that, the iterator will continue taking the wrong bytes as it passes over the string. The result is a bad read of the string, and garbled gobbledegook if it writes the result.

So, for this script, I would assume it's iterating over the tab or comma separated strings looking for the separating character. A triple byte character will mess all of that up.

I'm not sure if this is accounted for in the latest version of Python, but ten years ago, I had to write a special function that watched for triple byte characters as it iterated over a Chinese Buddhist UTF-8 string. The nice thing about it was that the first byte was the same for these characters. It served as a flag that said, "Hey, I'm special!" But anyone trying to write code for this problem will need to investigate the Unicode extensions to make sure that that's true. For all I know, there's more than one of these leading bytes that are used.

What I'm considering as a solution is to write a script that creates the Json files for Bilara directly from my local DB. Edit: But before I do that, I'll try a workaround of removing the rare characters and see if bilara-io can process the files, then add them back in manually.

@winterdharma
Copy link
Collaborator

winterdharma commented Nov 13, 2024

My experiment to see if the issue is what I think it is has only proven that Chinese Unicode characters appear to be the problem. I've tried exporting and importing MA 1 as a tsv file, and it fails on the import with an encoding error:

Traceback (most recent call last): File "C:\Users\cdpat\Desktop\SuttaCentral\bilara-data\.scripts\bilara-io\sheet_import.py", line 214, in <module> json.dump(merged_data, f, ensure_ascii=False, indent=2) ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\cdpat\AppData\Local\Programs\Python\Python313\Lib\json\__init__.py", line 180, in dump fp.write(chunk) ~~~~~~~~^^^^^^^ File "C:\Users\cdpat\AppData\Local\Programs\Python\Python313\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-4: character maps to <undefined>

Update: When I try to export and import MN 1, I get the same crash. So, I take back everything I've speculated about the Chinese text. It's probably just a Windows OS related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants