-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noncompliant JSON files tripping up Windows #3779
Comments
Hello. So. I attempted to implement @khemarato's suggestion. Unfortunately, a series of technical difficulties ensued with I installed ujson. Pip asked me to upgrade it, but upgrading it broke my python install. I then spent an hour reinstalling python, etc. Now, however, I haven't been able to get the python script output to go to VS Code's terminal window. Instead it runs in a command window. So .... I don't know what the crash report is, but the export.py script is still choking on the Chinese texts in Bilara when I attempt to create a .tsv file to test. I can export a Pali text like MN, but not MA. So, I believe we have a problem with those dreaded triple-byte unicode characters that represent rare Chinese characters. But it's just a hunch. |
Okay, I realized I could run it in a windows command prompt. So here is the new crash report: |
Just as some information for anyone trying to debug this: Rare Chinese characters have been added to Unicode that spill over the normal limit of characters that can be encoded to two byes of data. These characters have an extra byte added at their start to make a three-byte character. When they are added to UTF-8 text, we end up with mostly double-byte characters with a few triple-byte characters sprinkled in (for fun, I guess!). Normal string iterators DO NOT know this about UTF-8. They happily take two bytes for every character, decode them, and then process them. When they hit a triple byte character, they do this: The iterator takes the first two bytes, decodes them, and processes them as a character. Then the iterator takes the third byte of the triple-byte character and the first byte of the next character, decode it, and processes it as a character. After that, the iterator will continue taking the wrong bytes as it passes over the string. The result is a bad read of the string, and garbled gobbledegook if it writes the result. So, for this script, I would assume it's iterating over the tab or comma separated strings looking for the separating character. A triple byte character will mess all of that up. I'm not sure if this is accounted for in the latest version of Python, but ten years ago, I had to write a special function that watched for triple byte characters as it iterated over a Chinese Buddhist UTF-8 string. The nice thing about it was that the first byte was the same for these characters. It served as a flag that said, "Hey, I'm special!" But anyone trying to write code for this problem will need to investigate the Unicode extensions to make sure that that's true. For all I know, there's more than one of these leading bytes that are used. What I'm considering as a solution is to write a script that creates the Json files for Bilara directly from my local DB. Edit: But before I do that, I'll try a workaround of removing the rare characters and see if bilara-io can process the files, then add them back in manually. |
My experiment to see if the issue is what I think it is has only proven that Chinese Unicode characters appear to be the problem. I've tried exporting and importing MA 1 as a tsv file, and it fails on the import with an encoding error:
Update: When I try to export and import MN 1, I get the same crash. So, I take back everything I've speculated about the Chinese text. It's probably just a Windows OS related issue. |
Charles Patton is having some trouble with the BilaraIO scripts parsing the JSON on his machine. I took the liberty of running
jsonlint
across all the JSON files in the repository and indeed a few of them are not standards compliant, namely (at the time of this writing):In particular, non-BMP characters (such as U+28114) seem to be causing the default Windows
import json
parser to choke.There are a number of possible solutions here. The simplest would be to simply replace the standard json library with something more robust (ujson?) instead. @winterdharma - can you try that and let us know if it solves your issue?
That failing, we'll have to figure out the best way to make these json files compliant (escaping non-Unicode characters?)
The text was updated successfully, but these errors were encountered: