Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MessageToJson outputs the wrong type for uint64 and int64 in Python #2954

Closed
mortada opened this issue Apr 7, 2017 · 7 comments
Closed

MessageToJson outputs the wrong type for uint64 and int64 in Python #2954

mortada opened this issue Apr 7, 2017 · 7 comments

Comments

@mortada
Copy link

mortada commented Apr 7, 2017

Here's a very simple proto file:

$ cat test.proto
syntax = "proto3";

message Message {
    uint64 foo1 = 1;
    uint32 foo2 = 2;
    float  foo3 = 3;
    double foo4 = 4;
    bool   foo5 = 5;
    int64  foo6 = 6;
    int32  foo7 = 7;
}

and I created a protobuf message in Python as follows:

In [1]: import test_pb2
In [2]: message = test_pb2.Message()

In [3]: message.foo1 = 11

In [4]: message.foo2 = 22

In [5]: message.foo3 = 33

In [6]: message.foo4 = 44

In [7]: message.foo5 = True

In [8]: message.foo6 = 66

In [9]: message.foo7 = 77

In [11]: message
Out[11]:
foo1: 11
foo2: 22
foo3: 33
foo4: 44
foo5: true
foo6: 66
foo7: 77

In [12]: from google.protobuf.json_format import MessageToJson

In [13]: print(MessageToJson(message))
{
  "foo6": "66",
  "foo4": 44,
  "foo7": 77,
  "foo5": true,
  "foo2": 22,
  "foo1": "11",
  "foo3": 33
}

Note that foo6 and foo1 should not have quotes as they are integers and not strings. Using the protobuf_to_dict module does not have this problem

In [15]: from protobuf_to_dict import protobuf_to_dict

In [16]: protobuf_to_dict(message)
Out[16]:
{'foo1': 11,
 'foo2': 22,
 'foo3': 33.0,
 'foo4': 44.0,
 'foo5': True,
 'foo6': 66,
 'foo7': 77}

In [17]: import json

In [18]: json.dumps(protobuf_to_dict(message))
Out[18]: '{"foo6": 66, "foo4": 44.0, "foo7": 77, "foo5": true, "foo2": 22, "foo1": 11, "foo3": 33.0}'

My protoc version is 3.2.0 and python version is 3.5.2

$ protoc --version
libprotoc 3.2.0

$ python --version
Python 3.5.2 :: Anaconda custom (x86_64)
@xfxyjwf
Copy link
Contributor

xfxyjwf commented Apr 7, 2017

As per proto3 JSON spec, uint64/int64 fields should be printed as decimal strings. See:
https://developers.google.com/protocol-buffers/docs/proto3#json

The reason is that uint64/int64 is not part of JSON spec and many JSON libraries only support double precision. To prevent precision loss our proto3 JSON spec requires serializers to put int64/uint64 values in strings.

@xfxyjwf xfxyjwf closed this as completed Apr 7, 2017
@kirpit
Copy link

kirpit commented Nov 19, 2017

I wanted ask the same question for MessageToDict here before creating another issue.

I respect MessageToJson converting int64 and uint64 into string for some reason even though it is hard to understand, why a long / integer value (without precision?) should be translated into string when there is no limit on a JSON integer by the specs.

Regardless to this unexpected behavior, I am assuming only by the name MessageToDict has nothing to do with JSON specs or whatnot, and all it is expected to do is to take a message object and give back a Python dictionary without touching its data types. Unfortunately, it's behavior is the same with MessageToJson, which makes us impossible to use these utility functions who deal with int64 data types.

The question is, should MessageToDict return these values in Python dictionary without touching their data types?

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Nov 20, 2017

Isn't MessageToDict part of the json_format package? It should honor the same proto3 JSON spec as with MessageToJson.

@WloHu
Copy link

WloHu commented Mar 14, 2019

It's all nice that json_format package follows the spec but it's laughable that I can't do this:

m = test_pb2.Message.FromString(serialized_proto_bytes)
d = MessageToDict(m)
copy = test_pb2.Message(**d)

because of type inconsistency.
I know I'm using wrong package for this but if you went that far to produce Python dict which is JSON spec compatible and you allow to create message instances out of unpacked nested dict structures then why we have to use another Python package to add the missing link for proper dict serialization?

@proximous
Copy link

I'll add my vote to @WloHu that the code fragment above should work (ie the process should be reversible), even if we have to specify some optional argument like:
d = MessageToDict(m, preserve_int64_as_int=True)
Is something like that open for consideration or is there an existing recommended solution to this?

lichenran1234 added a commit to mlflow/mlflow that referenced this issue Nov 11, 2021
…SON (#5010)

* Fix an issue with the timestamps in the endpoint response

Signed-off-by: Chenran Li <[email protected]>

According to issue #4037, the returned JSON of the endpoints has `creation_timestamp` and `last_updated_timestamp` as strings, not numbers. It's different from what was documented in the [official doc](https://www.mlflow.org/docs/latest/rest-api.html#mlflowregisteredmodel).

The reason is we are calling Google's `MessageToJson` API to convert protobuf to json, which implicitly converts int64/fixed64/unit64 fields to strings. And they claimed it's a feature not a bug (see the [discussion](protocolbuffers/protobuf#2954 (comment))).

According to the bug reporter, this bug doesn't exist in Azure ML mlflow server (which is essentially our Databricks mlflow server). That's because we are using ScalaPB's `ToJson()` API for all the Databricks endpoints, and it doesn't convert int64 to string.

There is no way to let `MessageToJson` API not convert int64 to strings. Nor are there any other good Python proto-to-json libraries. So to fix this bug, we have to choose from:
* (what I'm doing in this PR) manually converting the int64/uint64/fixed64 fields back to numbers after calling `MessageToJson`
* (too risky so I chose not to do) writing our customized `MessageToJson` API
@sany2k8
Copy link

sany2k8 commented Jun 2, 2022

For my case MessageToDict makes units to string because of int64 type (google.type.Money) I did process it manually, is there any further good soln regarding this issue? Actually, I don't want to use this solution. I was looking something like d = MessageToDict(m, preserve_int64_as_int=True).

data = {"roomNights":[{"night":{"year":2022,"month":6,"day":2},"price":{"currencyCode":"USD","units":"100","nanos":10},"tax":{"currencyCode":"USD","units":35,"nanos":10}}],"fees":[{"title":"State Tax","price":{"currencyCode":"USD","units":"8","nanos":10}},{"title":"Occupancy Tax","price":{"currencyCode":"USD","units":"2","nanos":10}},{"title":"Resort Fees","price":{"currencyCode":"USD","units":"25","nanos":10}}],"totalCost":{"currencyCode":"USD","units":"135","nanos":10},"totalFees":{"currencyCode":"USD","units":"35","nanos":10},"totalBeforeFees":{"currencyCode":"USD","units":"100","nanos":10}}

def cast_units_string_to_int(cost_data: dict, key: str):
    for k, v in cost_data.items():
        if isinstance(v, dict):
            for kk, vv in v.items():
                if isinstance(vv, dict):
                    cast_units_string_to_int(vv, key)
                else:
                    if kk == key:
                        cost_data[k][kk] = int(vv)
                    else:
                        cost_data[k][kk] = vv 
        elif isinstance(v, list):
            for _, vvv in enumerate(v):
                cast_units_string_to_int(vvv, key)
                        
    return cost_data
                
print(data)
updated_units = cast_units_string_to_int(data, "units")
print(updated_units)

@Sschumac
Copy link

Sschumac commented May 1, 2023

Even if it is part of the json_format package, it is not formatting to json when calling MessageToDict. So why on earth keep this JSON convention as the default behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants