Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map and Object type problems #261

Closed
jroper opened this issue Jul 2, 2018 · 25 comments
Closed

Map and Object type problems #261

jroper opened this issue Jul 2, 2018 · 25 comments
Assignees

Comments

@jroper
Copy link
Contributor

jroper commented Jul 2, 2018

This is related to #184. Currently the spec introduces a lot of confusion around Maps and Objects when it comes to JSON.

So, the data attribute is an Object, which means it can be Map, String or Binary. Map is a map of name to Object pairs. So, if I have a data attribute of:

"data": {"foo": true}

It's not a Map. Why? Because true is not a valid Object according to the CloudEvenvts spec (a valid object must be a Map, String or Binary), and the CloudEvents spec says that Maps are mappings of Strings to Objects. So the above is actually invalid, if you wanted to pass the above, you would have to encode it like this:

"data": "{\"foo\": true}"

Now the data is a String, and so it's a valid CloudEvents Object, and it can be decoded to JSON. However this contradicts the JSON encoding spec, which says that a JSON object can be directly embedded into the Event json, which would effectively make the above event unencodable as soon as you tried to convert it from the JSON encoding to another encoding.

I think there are a number of points in the spec where JSON and the Map type are confused, particularly in the way JSON events are encoded. Either, the CloudEvents spec should support JSON as a top level concept, replacing Map with JSON, or, the Object type should be redefined to be either String or Binary, but not Map, and force JSON to be encoded as a String. If this doesn't happen, the CloudEvents implementations I think will forever be plagued with issues arising from ambiguities of when to treat something as JSON and when to treat it as a String, the result being things like something going in as JSON, but coming out as a String, or vice versa, especially when events go through different encodings, for example, through JSON encoding, then HTTP, then MQTT, etc.

@deissnerk
Copy link
Contributor

@jroper I agree. I get confused by this, too. I also wonder, how someone receiving a CloudEvent in JSON format is supposed to distinguish, if the event data is String or Binary. For Binary I would need to base64 decode, before passing it on in HTTP binary mode. Just deriving this from the content type, might be very error prone.

@duglin
Copy link
Collaborator

duglin commented Oct 30, 2018

ping @clemensv

@duglin duglin added the v0.2 label Oct 31, 2018
@duglin
Copy link
Collaborator

duglin commented Nov 1, 2018

@jroper see: #184 (comment)

@clemensv
Copy link
Contributor

clemensv commented Nov 6, 2018

We've meanwhile renamed Object to Any, which at least takes some of the conceptual confusion out of the model.

The "data" example above is compliant with the JSON mapping, because it falls under the rules of 3.1. which effectively allows free-form JSON.

For all other attributes, including their structured contents, it is correct that a plain Boolean isn't supported. We don't have that type.

@duglin
Copy link
Collaborator

duglin commented Nov 8, 2018

@jroper any thoughts on this?

@duglin
Copy link
Collaborator

duglin commented Nov 21, 2018

@jroper ^^

@duglin duglin removed the v0.2 label Nov 29, 2018
@alanconway
Copy link
Contributor

alanconway commented Dec 19, 2018

I'm having lots of problems in this area trying to decode a JSON structured event so I can forward it as a binary event on another transport:

  • impossible to determine if a JSON Map value is String or Binary
  • difficult to determine whether the data attribute is String or Binary
  • difficult to determine whether a binary event message needs encoding or not.

Details:

Spec: "CloudEvents type can be determined by inference using the rules from the mapping table, whereby the only potentially ambiguous JSON data type is string"

The ambiguity is fatal. Given a JSON event { ..., "data": "buzzkill } is the data:

  1. The text String "buzzkill"
  2. The base64-encoded Binary value [110 236 243 146 41 101]

There's no safe way to know.

Spec: "whether a Base64-encoded string in the data attribute is treated as Binary or as a String is also determined by the contenttype value."

Fine if content type starts with "text", but not if it starts "application". My event bridge only needs to know whether or not to base64 decode the data in order to forward a binary event. How does it know that application/zip is binary but application/javascript is text? Does it need a table of every media type?

Suggestion:

  1. Restrict Any to Map, Integer and String. That is a lowest-common-denominator that can be rendered even in JSON. Applications need to know how to interpret their own map entries anyway, so if the app needs to base64 or otherwise encode a map value, it can.

  2. Add an optional attribute "binary" to the event context. If present with any value the data is binary, if not the data is text.

The originator of the message sets the binary flag - they provide the data so are in a position to know whether it is binary or text. Intermediaries don't need to interpret the contenttype to function.

@duglin
Copy link
Collaborator

duglin commented Jun 8, 2019

@jroper @alanconway is this still an issue?

@duglin duglin added the v1.0 label Jun 8, 2019
@jroper
Copy link
Contributor Author

jroper commented Jun 13, 2019

I think this is still an issue.

So firstly, while Map can be rendered as JSON, JSON cannot be rendered as Map. So, if I consume a cloud event using the JSON format, and it has a data of {"foo": true}, I can't express that as a Map. For this reason, I don't think Map is useful as a possible type for data.

I think the best way to deal with this is the same way everything else does. Make data just a single type, by itself it's an opaque blob that only the domain can understand, and when combined with the datacontenttype, then further reasoning can be made about what it contains by things that are not privy to the domain, eg, if the content type is application/json, then we can know that it can be parsed and processed as JSON.

So, effectively, the type is Binary. But different encodings can use the datacontenttype to be smarter about it. So when encoding to JSON, we already have most of the rules in place, if the content type is JSON (or +json or any content type that is known to be JSON), then it gets embedded directly as a JSON value. The only thing that needs to be added is that if the content type is text/* or any other content type that is known to be text, the value can be encoded as a JSON string. Otherwise, it's a base64ed JSON string.

Note that currently there are two different sources of information for the type of the data, there is the cloud events type information, and there's datacontenttype. What if these disagree? What if I have cloud event and I make the data a Binary, but I set the datacontentype to text/plain? If I encode this to JSON, do my bytes end up in a base64ed string? If I send it over HTTP, do I end up with binary at the other end, or does the content type take precedence? I can't see anywhere in the spec where this is specified, but these unanswered questions are likely to cause real world problems when different implementations have different ideas about how the spec should be interpreted. By making data only have one type of Binary, and then allowing datacontenttype to specify how that binary data might be handled (if a higher level abstraction wants to represent it as something else, eg as a JSON AST), we remove the potential conflict. And it solves the map and object type problems.

@duglin
Copy link
Collaborator

duglin commented Jun 13, 2019

@jroper the spec, at some point, was changed for the definition of maps:
Map - String-indexed dictionary of Any-typed values.
which I believe means that {"foo": true} is a valid CloudEvent map, right?

I think the best way to deal with this is the same way everything else does. Make data just a single type, by itself it's an opaque blob that only the domain can understand...

Do we not do this today by saying data's type is Any ?

@jroper
Copy link
Contributor Author

jroper commented Jun 13, 2019

I don't think so, because here is the current definition of Any:

Any - Either a Binary, Integer, Map or String.

So, booleans, arrays and null can't be expressed in that definition of Any or Map. Or put it this way, I'm not sure what programming languages you're familiar with, but I assume you're familiar with Go. Imagine modelling the Map datatype with Go. It might look something like this (excuse my Go, I'm very new to the language):

type CloudEventsAny interface {
  typeName() string
}

type CloudEventsMap struct {
  map[string]CloudEventsAny
}

func (m CloudEventsMap) typeName() string {  
    return "Map"
}

type CloudEventsString struct {
  string
}

func (s CloudEventsString) typeName() string {  
    return "String"
}

type CloudEventsNumber struct {
  Decimal
}

func (n CloudEventsNumber) typeName() string {  
    return "Number"
}

type CloudEventsBinary struct {
  Buffer
}

func (b CloudEventsBinary) typeName() string {  
    return "Binary"
}

If I've got my go right, the above could be one way to represent the CloudEvents Any AST in Go. Now, let's say you receive a Cloud Event formatted in JSON that looks like this:

{
  "data": {
    "foo": true,
    "bar": [1, 2, 3]
  }
}

And you want to parse that data attribute into the above format, so that you're working with it in a way that is agnostic from the wire binding. How do you do it? You can't. It's impossible. Unless you represent the whole JSON as a String or Binary. then it works. But then what is the point of saying that data can be a Map? There is no point. It's not useful.

@duglin
Copy link
Collaborator

duglin commented Jun 13, 2019

See: https://github.com/cloudevents/spec/pull/437/files

All of this makes me wonder why we're bothering to say what "Any" is at all. Why not just say it's "any data type" or "unspecified" - and as long as the sender can serialize it in the chosen encoding format then why should the CE spec care?

@jroper
Copy link
Contributor Author

jroper commented Jun 13, 2019

That's essentially what I'm saying. It's a blob, opaque to the spec. Implementations can use content type to gather more meaning from it. BUT it is actually more complex than that because then you get to the JSON encoding spec, and it does have some very concrete things to say about its type. And that's a good thing, because if the event is JSON, then it makes sense to have it just as an embedded JSON object, rather than a base64 encoded representation of the JSON. But the most important thing I think is that when you round trip from one encoding and back to the abstract, to another encoding and back, you always end up with the same data. And to do that, I think it needs to be opaque, a blob. The spec as it currently stands you could easily end up with the following payload:

{
  "data": {
    "foo": [1, 2, 3]
  }
}

Ending up being encoded as:

{
  "data": "{\"foo\":[1,2,3]}"
}

ie, the actual payload has changed, and is now corrupt, and I think this could easily happen with different implementations of different encodings of the current spec after a few round trips between them.

So, the specific changes that I would say is that the type of data should be left unspecified, with a note saying many implementations may choose to represent this as a byte buffer, or decode it into higher level formats based on content type. Then, the JSON encoding spec needs to be updated with precise wording on exactly when the data should be a JSON value, or a String, or a base64 encoded binary. Additionally, the JSON encoding needs to be updated to say exactly how the type information it contains should be handled when converting to an abstract representation of teh cloud event, eg, if you have a data that is a String, and no datacontenttype is present, then when converting to the abstract representation, datacontenttype should default to text/plain;charset=utf-8, and the string should be encoded/handled as such.

@deissnerk
Copy link
Contributor

@duglin I don't think the type system is the problem. If I understand @jroper correctly, it is applying this type system to the data attribute. I would even go further and state, that data should not be regarded as attribute at all. It becomes one in the JSON format, but even in the JSON format a whole lot of extra rules have been necessary to cover data.
To better reflect the conceptual model of CloudEvents in JSON, you would have:

{
  "context": {
    "id": "abc",
    "type": "my.event.type",
    ...
  }
  "data": ...
}

I haven't checked all SDKs, but for the go-sdk you have exactly this:

// Event represents the canonical representation of a CloudEvent.
type Event struct {
	Context     EventContext
	Data        interface{}
	DataEncoded bool
}

@duglin
Copy link
Collaborator

duglin commented Jun 13, 2019

TBH, I don't use the go sdk, I just have something like this in my code:

type Event struct {
    SpecVersion     string            `json:"specversion,omitempty"`
    Type            string            `json:"type,omitempty"`
    Source          string            `json:"source,omitempty"`
    Subject         string            `json:"subject,omitempty"`
    ID              string            `json:"id,omitempty"`
    Time            string            `json:"time,omitempty"`
    SchemaURL       string            `json:"schemaurl,omitempty"`
    DataContentType string            `json:"datacontenttype,omitempty"`
    Extensions      map[string]string `json:"extensions,omitempty"`
    Data            json.RawMessage   `json:"data,omitempty"`
}

The extra "context" wrapper is a valid choice, but it is just a choice.

But I do agree that it seems like we should just stop trying to over-specify data. And, if we drop the structured format (only 1/2 joking) we wouldn't need to have data at all - we just define the extra metadata to add to an already existing event/serialization.

@alanconway
Copy link
Contributor

alanconway commented Jun 13, 2019 via email

@duglin
Copy link
Collaborator

duglin commented Jun 13, 2019

Anyone want to volunteer to create a PR?

@duglin
Copy link
Collaborator

duglin commented Jun 14, 2019

@jroper do you think you could create a PR with what you're thinking? It would be great to have something to consider for next week's call :-)

@jroper
Copy link
Contributor Author

jroper commented Jun 15, 2019

I'll take a stab on Monday.

@jroper
Copy link
Contributor Author

jroper commented Jun 17, 2019

Starting to look at this now. First thing I think to do is work out what to call the data, if it's not going to be called an attribute (which I think is a good idea to not call it an attribute). I went through the spec (I did skim over a few of the binding specs and there was nothing too notable there) to work out all the different names that we currently use to refer to the data. Here they are:

  • Data
  • payload
  • data attribute
  • data content
  • data payload
  • data field
  • event data
  • event "data"
  • data
  • data
  • data items

Just to be clear, I'm not saying that all of the above need to be changed to all use the same name, sometimes using a different name does make sense, some of the names above are referring are intentionally referring to the data in a slightly different context and the different name is necessary to convey that. But I do think some of the names above need to be consolidated.

The challenge with the term data is that of course, everything is data. Each attribute by itself is data, the attributes an aggregate are data, there is data that is not part of the event, etc. So I think this rules out using the term "data" by itself, because it is not precise enough, there's too much chance for ambiguity, and I think we should free up the spec (and bindings etc) to be able to use the data term in a generic sense without the chance of it being confused for the data payload.

So one option is to use Data (ie, with a capital D). That already appears in the definitions, and I think it makes it unambiguous. The problem with Data is that it's not actually a proper noun, it's not like we've created a trademark term for this, it is just ordinary data, and we've used the term data because that's what it is. So it doesn't really work grammatically. Also, I don't think it would read that well, it's just a bit odd having capital D data all throughout the spec. That said, I think it works in the definitions section, so I think we should leave that as is, and when referring specifically to the definitions (as the Data Attribute section currently does), we can use Data. I think though the definitions section should define what term will be used to refer to the Data in the rest of the spec.

So, I'm torn between two names.

The first is "data payload". In most contexts, payload works really well, it's a commonly used term in the context of metadata and data (used in many programmatic APIs as well as protocols). The definition of Data even currently says "(i.e. the payload)".

The other is "data content". The reason I like this is because of the naming of the attributes "datacontenttype" and "datacontentencoding". Since they call it content, perhaps the entire spec should call it content.

I like the name payload better, but I think content is more consistent. I'll start with payload, and we can bikeshed the names in this issue, PR and other calls until we come to a resolution.

@jroper
Copy link
Contributor Author

jroper commented Jun 17, 2019

With this - can we remove the Any type? The reason I ask this, currently many of the encodings are unable to correctly round trip the Any type. JSON can't, for example, if you have a Binary, it gets encoded to a base64ed string, but decoding, there's no way certain way to know if the the value should be a String as is, or base64 decoded into a Binary. HTTP is even worse - when decoding an Any typed attribute from an HTTP header, if it only contains digits, it could be an Integer, String or Binary. No way to know which. So there's no way to correctly round trip and Any to and from HTTP encoding.

Now that data is not an attribute, the Any type is only used by Map, and Map doesn't have to use the Any type. In fact, I think I'd like to propose the removal of Map too, since nothing that I can see (not the spec itself, nor any existing extensions) uses it. Better to keep things simple if it's not needed.

The biggest problem with maps comes from the HTTP encoding - HTTP encoding requires the Map keys to be put in the header name, but that places severe constraints that don't otherwise exist on what valid keys can be (pretty much limits it to alphanumeric ascii characters). Also, they lose case sensitivity. And, nested maps can't be encoded. And integers can't be encoded. A round trip of a Map to/from HTTP encoding is going to very frequently corrupt the event.

@duglin
Copy link
Collaborator

duglin commented Jun 17, 2019

Lots going on here :-)

  • I don't mind changing how to refer to data. It does seem a little odd to call it an attribute, but on the other hand, in the structured format it does appear like the other CE attributes - so I suspect that's why it started out that way.
  • re: maps - see: HTTP binary mode header mapping and header count limits #367 (comment) if we do 2, I'd think some of your concerns go away
  • re: Any - given our recent change to the type system where we have a "String" encoding, I wonder if we couldn't just abstract it and remove our type system and allow each attribute definition define the constraints on its values (e.g. must look like an RFC3339 timestamp, must be a positive int, etc...). Then it's up to the impl to decide how to map that attribute into/from a string.

jroper added a commit to jroper/spec that referenced this issue Jun 18, 2019
Fixes cloudevents#261

This change causes data to not be thought of as an attribute, rather to
be considered an opaque blob described by the content type. The name
given is data payload. The wording in the JSON encoding has been
modified to reflect this.
jroper added a commit to jroper/spec that referenced this issue Jun 18, 2019
Fixes cloudevents#261

This change causes data to not be thought of as an attribute, rather to
be considered an opaque blob described by the content type. The name
given is data payload. The wording in the JSON encoding has been
modified to reflect this.

Signed-off-by: James Roper <[email protected]>
jroper added a commit to jroper/spec that referenced this issue Jun 18, 2019
Fixes cloudevents#261

This change causes data to not be thought of as an attribute, rather to
be considered an opaque blob described by the content type. The name
given is data payload. The wording in the JSON encoding has been
modified to reflect this.

Signed-off-by: James Roper <[email protected]>
@clemensv
Copy link
Contributor

@duglin on your type system comment: The point of having it is that you have one reference to constraint. There needs to be a rule for how to express dates that's valid for the one instance we have in the core, but also for all extensions. Otherwise you end up with the CloudEventsTimeAttributeType, because each attribute can be different.

Maps are well-defined as JSON objects today.

For 'Any': If we allow for polymorphic maps, we arguably need Any for consistency. That said, since Maps don't appear in the abstract, I don't see this being a practical issue.

@clemensv
Copy link
Contributor

Regarding the original objection of the issue: The content of "data" is specifically addressed in the JSON encoding (section 3.3) and any valid JSON is valid for that attribute.

If the datacontenttype value is either "application/json" or any media type with a structured +json suffix, the implementation MUST translate the data attribute value into a JSON value, and set the data attribute of the envelope JSON object to this JSON value.

@duglin
Copy link
Collaborator

duglin commented Aug 15, 2019

With #484 merged I believe we can close this - if I'm mistaken please let me know and we can reopen it.

@duglin duglin closed this as completed Aug 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants