-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 with BOM and UTF-16 LE encoded files produce read error #13
Comments
I had forgotten about the prevalence of UTF-16 on Windows although I can't say I've ever seen UTF-8 files with a BOM. I suspect this would be relatively easy to detect and support via the Out of curiosity, do you happen to know what utilities generated the UTF-8 w/ BOM files? |
Turns out it's invalid to include a BOM in a JSON document. I'll make this case fail with a better error message but it doesn't make sense for a JSON validator to accept invalid JSON 😉. As for the other UTF-* encodings, that post offers a simple approach to detect them based on the first few bytes. |
Further complicating matters, YAML does support BOMs and are required for UTF-16 detection. |
@neilpa - There are / were certain situations where older PowerShell versions will encode UTF-8 with BOM when "piping" the output of a text value to a file. As for the UTF-16 LE encoding, I can't say where those are coming from except to say that some of our developers are using Macs, so maybe the default encoding for them is different. It could also just be a matter of editor choice by the developers as well. |
FYI: The RFC cited was obsoleted by this one: https://tools.ietf.org/html/rfc8259 And, in both the following text is present: Implementations MUST NOT add a byte order mark (U+FEFF) to the So, it's really up to you as to whether or not you want your schema validation utility to be able to parse what might be considered "invalid" JSON data. I understand if you do, but you could also consider it a warning condition and maybe add a switch to ignore warnings, leaving the default to treat them as errors. |
Thanks for looking more closely at the RFCs. Obviously I didn't read those carefully enough last night. Given that second sentence it seems reasonable to allow it.
Right now files are bucketed to a pass, fail, or error, the last being invalid JSON. I'd say these should still default to an "error" but with a better message about the BOM. The rationale being the moment they are served as-is over the internet the become invalid. As for the switch, I'm thinking along the lines of |
@neilpa - As a potential future feature request: Have you considered adding a "debug" switch like -d, where you would send useful information to standard out, such as the text encoding? It could work with the switch you suggested -allow-bom to help those of us who are implementing your utility in a pipeline to figure out potential issues. Your thoughts? |
@GlenAppleton I posted a v1.4.0 release with support for UTF-16 files. I also added the Let me know if you have any issues. I haven't tried this manually on Windows but the large test matrix I added passes the Github CI build on Windows.
I decided to not do this for now but made a note about the possibility in #12 |
@neilpa - Thanks! I'll give it a try once I finish my current project task. |
I've been running some tests using your yajsv (windows 64 version) utility, and I've found that both UTF-8 (BOM) and UTF-16 LE encoded files throw the following errors:
UTF-8 with BOM : error: validate: invalid character 'ï' looking for beginning of value
UTF-16 LE : error: validate: invalid character 'ÿ' looking for beginning of value
For my purpose implementing in a PowerShell script, it's not a big deal since I can re-encode the files to standard UTF-8 and it works, but I thought you'd like to know about the encoding limitations.
Great utility, BTW!
The text was updated successfully, but these errors were encountered: