Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with json::parse decoding codepoints #3142

Closed
gtrevi opened this issue Nov 16, 2021 · 5 comments
Closed

Issue with json::parse decoding codepoints #3142

gtrevi opened this issue Nov 16, 2021 · 5 comments
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation

Comments

@gtrevi
Copy link

gtrevi commented Nov 16, 2021

I narrowed down a basic issue with json::parse in decoding escaped ASCII characters. The encoding is done in "bare metal C" by embedded software, using a standard UTF8 encoding algorithm, and I need to read that JSON string back from a PC:

  • JSON to be encoded: "{\"property\":\"Temperature is 22°C\"}"
  • JSON encode by the embedded SW: "{\"property\":\"Temperature is 22\\uc2b0C\"}" (° is encoded as unicode 0xc2 + 0xb0)

When I test-serialize the parsed JSON with:

std::string json = "{\"property\":\"Temperature is 22\\uc2b0C\"}";
json j = json::parse(json);
std::stringstream sb;
sb << j << std::endl;

The content of sb is:

"{\"property\":\"Temperature is 22슰C\"}\n"

Problem is those two extra ìŠ chars, since json::parse treats the encoded character as 3 bytes, instead of 2. This is the code portion where it does this:

//line 7000 of version 3.10.3

// result of the above calculation yields a proper codepoint
JSON_ASSERT(0x00 <= codepoint && codepoint <= 0x10FFFF);

// translate codepoint into bytes
if (codepoint < 0x80)                            
{
  // 1-byte characters: 0xxxxxxx (ASCII)
  add(static_cast<char_int_type>(codepoint));
}
else if (codepoint <= 0x7FF)
{
  // 2-byte characters: 110xxxxx 10xxxxxx
  add(static_cast<char_int_type>(0xC0u | (static_cast<unsigned int>(codepoint) >> 6u)));
  add(static_cast<char_int_type>(0x80u | (static_cast<unsigned int>(codepoint) & 0x3Fu)));
}....

Thanks for any hint!

@gtrevi gtrevi changed the title json::parse does not decode codepoints correctly Issue with json::parse decoding codepoints Nov 16, 2021
@nlohmann
Copy link
Owner

According to Wikipedia, the codepoint for the degree sign ° is U+00B0, and

{"property":"Temperature is 22\u00b0C"}

roundtrips correctly:

#include <iostream>
#include <nlohmann/json.hpp>

using json = nlohmann::json;

int main() {
    // parse with \u00b0
    std::string json_string = R"({"property":"Temperature is 22\u00b0C"})";
    json j = json::parse(json_string);
    std::cout << j << std::endl;

    // parse with Unicode and dump with "ensure_ascii=true"
    std::string json_string2 = R"({"property":"Temperature is 22°C"})";
    json j2 = json::parse(json_string);
    std::cout << j2.dump(-1, ' ', true) << std::endl;
}

@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Nov 17, 2021
@nlohmann
Copy link
Owner

(I you mixed the code point U+00B0 with the UTF-8 byte sequence 0xc2 0xb0, see https://www.fileformat.info/info/unicode/char/b0/index.htm.

@gtrevi
Copy link
Author

gtrevi commented Nov 17, 2021

Thank you for the prompt response and proposed fix.

Correct, I am using UTF8 encoding on the embedded system. I read here that the library just supports UTF8, so I thought that was the way to go.

Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259.

Since I get the string as bytes from the embedded system (via TCP/IP), processing it on the PC must not rely on any pre-processing literals, such L/ R/etc. So the sample JSON string above isn't actually a constant, but represents a flat C byte-array that I need to programmatically load into an std::string on the PC's C++ code (and for the moment I'm using std::codecvt_utf8).

What would you suggest to be the best encoding practice to have a smooth 2-way transition of contents between your library, and the bare-metal embedded system using UTF8 in C? i.e.:

  • What encoding to use for the JSON char-array text on the embedded system (bare-metal C only). This is UTF8 at the moment.
  • Best way to dump the JSON object into an std::string on the PC side (C++), and which decoding needs to be applied on the embedded side.

Thanks for any advise!

@nlohmann
Copy link
Owner

The Unicode codepoint U+00B0 can be expressed in two ways in JSON:

  • As UTF-8 encoded string.
  • As escape sequence \u00b0.

That said, the library expects UTF-8 encoding. So a string needs to have the bytes 0xc2 0xb0 which you can achieve with code like

std::string s;
s.push_back(0xc2);
s.push_back(0xb0);

In any case, you should not mix the \uxxxx escaping and the UTF-8 bytes.

@gtrevi
Copy link
Author

gtrevi commented Nov 19, 2021

I'm now feeding flat (non-escaped) UTF8 bytes (° is 0xc2 0xb0), and the dump outputs seem ok:

std::string json_from_embedded_system = "{\"property\":\"Temperature is 22°C\"}";
nlohmann::json j = json::parse(json_from_embedded_system);
cout << j.dump() << std::endl; // returns `{"property":"Temperature is 22°C"}`
cout << j.dump(-1, ' ', true) << std::endl; // returns `{"property":"Temperature is 22\u00b0C"}`

I think this can be closed on my side, thank you for clarifying the input & output encoding options of the library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

2 participants