Issue with json::parse decoding codepoints #3142

gtrevi · 2021-11-16T08:58:22Z

I narrowed down a basic issue with json::parse in decoding escaped ASCII characters. The encoding is done in "bare metal C" by embedded software, using a standard UTF8 encoding algorithm, and I need to read that JSON string back from a PC:

JSON to be encoded: "{\"property\":\"Temperature is 22°C\"}"
JSON encode by the embedded SW: "{\"property\":\"Temperature is 22\\uc2b0C\"}" (° is encoded as unicode 0xc2 + 0xb0)

When I test-serialize the parsed JSON with:

std::string json = "{\"property\":\"Temperature is 22\\uc2b0C\"}";
json j = json::parse(json);
std::stringstream sb;
sb << j << std::endl;

The content of sb is:

"{\"property\":\"Temperature is 22ìŠ°C\"}\n"

Problem is those two extra ìŠ chars, since json::parse treats the encoded character as 3 bytes, instead of 2. This is the code portion where it does this:

//line 7000 of version 3.10.3

// result of the above calculation yields a proper codepoint
JSON_ASSERT(0x00 <= codepoint && codepoint <= 0x10FFFF);

// translate codepoint into bytes
if (codepoint < 0x80)                            
{
  // 1-byte characters: 0xxxxxxx (ASCII)
  add(static_cast<char_int_type>(codepoint));
}
else if (codepoint <= 0x7FF)
{
  // 2-byte characters: 110xxxxx 10xxxxxx
  add(static_cast<char_int_type>(0xC0u | (static_cast<unsigned int>(codepoint) >> 6u)));
  add(static_cast<char_int_type>(0x80u | (static_cast<unsigned int>(codepoint) & 0x3Fu)));
}....

Thanks for any hint!

The text was updated successfully, but these errors were encountered:

nlohmann · 2021-11-17T18:39:03Z

According to Wikipedia, the codepoint for the degree sign ° is U+00B0, and

{"property":"Temperature is 22\u00b0C"}

roundtrips correctly:

#include <iostream>
#include <nlohmann/json.hpp>

using json = nlohmann::json;

int main() {
    // parse with \u00b0
    std::string json_string = R"({"property":"Temperature is 22\u00b0C"})";
    json j = json::parse(json_string);
    std::cout << j << std::endl;

    // parse with Unicode and dump with "ensure_ascii=true"
    std::string json_string2 = R"({"property":"Temperature is 22°C"})";
    json j2 = json::parse(json_string);
    std::cout << j2.dump(-1, ' ', true) << std::endl;
}

nlohmann · 2021-11-17T18:41:38Z

(I you mixed the code point U+00B0 with the UTF-8 byte sequence 0xc2 0xb0, see https://www.fileformat.info/info/unicode/char/b0/index.htm.

gtrevi · 2021-11-17T22:19:35Z

Thank you for the prompt response and proposed fix.

Correct, I am using UTF8 encoding on the embedded system. I read here that the library just supports UTF8, so I thought that was the way to go.

Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259.

Since I get the string as bytes from the embedded system (via TCP/IP), processing it on the PC must not rely on any pre-processing literals, such L/ R/etc. So the sample JSON string above isn't actually a constant, but represents a flat C byte-array that I need to programmatically load into an std::string on the PC's C++ code (and for the moment I'm using std::codecvt_utf8).

What would you suggest to be the best encoding practice to have a smooth 2-way transition of contents between your library, and the bare-metal embedded system using UTF8 in C? i.e.:

What encoding to use for the JSON char-array text on the embedded system (bare-metal C only). This is UTF8 at the moment.
Best way to dump the JSON object into an std::string on the PC side (C++), and which decoding needs to be applied on the embedded side.

Thanks for any advise!

nlohmann · 2021-11-18T11:47:27Z

The Unicode codepoint U+00B0 can be expressed in two ways in JSON:

As UTF-8 encoded string.
As escape sequence \u00b0.

That said, the library expects UTF-8 encoding. So a string needs to have the bytes 0xc2 0xb0 which you can achieve with code like

std::string s;
s.push_back(0xc2);
s.push_back(0xb0);

In any case, you should not mix the \uxxxx escaping and the UTF-8 bytes.

gtrevi · 2021-11-19T00:37:15Z

I'm now feeding flat (non-escaped) UTF8 bytes (Â° is 0xc2 0xb0), and the dump outputs seem ok:

std::string json_from_embedded_system = "{\"property\":\"Temperature is 22Â°C\"}";
nlohmann::json j = json::parse(json_from_embedded_system);
cout << j.dump() << std::endl; // returns `{"property":"Temperature is 22Â°C"}`
cout << j.dump(-1, ' ', true) << std::endl; // returns `{"property":"Temperature is 22\u00b0C"}`

I think this can be closed on my side, thank you for clarifying the input & output encoding options of the library!

gtrevi changed the title ~~json::parse does not decode codepoints correctly~~ Issue with json::parse decoding codepoints Nov 16, 2021

nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Nov 17, 2021

nlohmann closed this as completed Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with json::parse decoding codepoints #3142

Issue with json::parse decoding codepoints #3142

gtrevi commented Nov 16, 2021 •

edited

Loading

nlohmann commented Nov 17, 2021

nlohmann commented Nov 17, 2021

gtrevi commented Nov 17, 2021 •

edited

Loading

nlohmann commented Nov 18, 2021

gtrevi commented Nov 19, 2021 •

edited

Loading

Issue with json::parse decoding codepoints #3142

Issue with json::parse decoding codepoints #3142

Comments

gtrevi commented Nov 16, 2021 • edited Loading

nlohmann commented Nov 17, 2021

nlohmann commented Nov 17, 2021

gtrevi commented Nov 17, 2021 • edited Loading

nlohmann commented Nov 18, 2021

gtrevi commented Nov 19, 2021 • edited Loading

gtrevi commented Nov 16, 2021 •

edited

Loading

gtrevi commented Nov 17, 2021 •

edited

Loading

gtrevi commented Nov 19, 2021 •

edited

Loading