proto3 and unknown fields #272

joshuarubin · 2015-04-07T23:54:59Z

I know that unknown fields have been removed from proto3, but I am trying to get an explanation about why this change was made and if there is any way to replicate that behavior in proto3.

Thanks so much.

referred from golang/protobuf#25

dhendry · 2015-05-06T19:36:19Z

I too am wondering about this. I am looking into migrating what is essentially a messaging system to gRPC (where proto3 seems to be recommended). In my case, clients send messages (text plus rendering information) to each other via a server where the server needs to understand the text and certain parts of the rendering info. I want to allow client developers to experiment with new features (pre-release) without having to deploy server code for every change.

Essentially, its a case where I want a shared proto definition between the client(s) and server, but dont want to require the server proto definition to be the latest to process requests.

solicomo · 2015-07-23T07:03:37Z

I'd like to hear about the explanation, too.

The behavior of proto2 makes sense to me.

jeremyong · 2016-03-14T18:23:11Z

I have a lot of concerns about silently deleting data upon deserialization, to the point that even though we have internally been using proto3 for several months, I am considering changing things back to proto2. This change would be a lot easier to stomach if there was a message option to allow serialization and deserialization of unknown fields instead of discarding them.

jeremyong · 2016-03-14T21:13:57Z

Being unable to add unknown fields that persist is also unacceptable for us. Reading the code, it's pretty clear the decision to omit unknown fields happens at compile time rather than at runtime (based on the generated code), so it seems proto3 is a no-go. Personally, I very much liked most of the changes to the new version except this one. Changing the default behavior alone might have been ok, especially given that the new behavior is well-documented, but doing so without a way to restore old behavior seems like a misstep. Supporting a plugin that reverts that behavior seems too expensive relative to the cost of just using proto2 with restrictions (optional only, etc).

dhendry · 2016-04-19T20:06:04Z

Still no answers to this? This is a fundamental issue which is seriously hindering our the adoption of protobuf in many areas.

jeremyong · 2016-04-19T20:22:42Z

+1 proto2 is a permanent fixture for us. Changing default behavior is one thing but changing it in a way that doesn't let the user even control it is a strict loss in my opinion. What I foresee moving forward is a huge fragmentation in the client ecosystem. Maintaining support for both proto2 and proto3 semantics is too much to chew for most developers, and I'm already seeing some client libraries do this awkward dance where they have some proto2 properties and some proto3 properties. The easiest example of this causing a problem in history is the move from Python2 to Python3. One possible solution might be a file level option that informs the protobuf compiler not to strip unknown fields.

liujisi · 2016-04-20T20:31:17Z

The proto3 spec doesn't forbid preserving unknown fields. Instead, it allows implementation to choose whether to preserve unknowns. The current C++/Java chose to drop the unknowns though. We are currently looking the issue and will keep this thread posted.

jeremyong · 2016-04-20T20:40:15Z

Thanks @pherl for providing the update. FWIW, I think it is worth considering how the behavior might be standardized, for the same reason people argue against undefined behavior in C or C++. Undefined behavior (if present) should really be due to a lack of foresight if it exists, but for something like this, we might as well come up with an actual solution since we're already aware of the problem.

joshuarubin · 2016-04-20T20:51:53Z

Thanks for keeping this issue alive. I'd just like to add that we are interested in support for Go, but that might need to be addressed in golang/protobuf.

jeremyong · 2016-05-06T18:44:41Z

@pherl Any progress on this front?

gfecher · 2016-06-12T09:42:03Z

+1 for preserving unknown fields.

I accept that you can not trivially maintain compatibility with the JSON format (at least as long as you want to marshal fields with their names), but I think a lot of shops would be happy to pay this price for not having to release their low-level infrastructure in lock step with their newest clients.

In fact Kenton seems to wonder himself (https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html): Apparently, version 3 of Protocol Buffers, aka “proto3”, removes this feature. I honestly don’t know what they’re thinking. This feature has been absolutely essential in many of Google’s internal systems.

In my opinion the right approach would be to make this an option of the proto compiler on compiling the proto: this way everybody can decide for themselves whether the benefits outweigh the downsides.

For now I have overridden the PreserveUnknownFields function in both cpp_helpers.h and java_helpers.h in the compiler code to always return true and this seems to work, but I would appreciate it if someone from google could confirm.

xfxyjwf · 2016-06-12T18:29:54Z

Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back.

jeremyong · 2016-06-12T19:07:50Z

Here is a use case I developed internally that makes heavy usage of unknown
fields:

In addition to the message itself, we often annotate the message before
sending it over the wire with metadata indicating if a field was deleted or
not, if it was set to a default field, etc. Internally, we use a diff-ing
scheme to create a protobuf message "diff" which handles maps, fields, and
messages (recursively applied). The application of the diff itself is
associative, so many diffs can accumulate into one, and this makes for a
fairly elegant scheme for updating state for a particular message across
many clients that may or may not be online.

Generalizing this use case, any protobuf message that is derived from the
reflection API must necessarily leverage the unknown field set, since by
definition, we cannot know the shape of the message a priori. Think of this
as a "higher order message" whereas messages that are schema defined are
first order messages.

On Sun, Jun 12, 2016 at 11:30 AM, Feng Xiao [email protected]
wrote:

Some updates: we tried to gather data to prove "unknown fields are
essential for Google systems", but the result is not so convincing (the
experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in
proto3, could you describe your use case in more details and explain why
unknown fields is required (e.g., can the same use case be supported using
some other proto3 features)? We need to prove unknown fields are needed in
some common use cases in order to add it back.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAPRJdU9zmn_iHC60rz014oYrvt0n_zQks5qLFBXgaJpZM4D8C3u
.

Jeremy Ong
PlexChat CTO
650.400.6453

gfecher · 2016-06-13T18:18:20Z

Hi,

We have a use case with a mixture of data validation/data transformation and storage.
Our infrastructure component understands certain bits of the schema that it validates/changes, but it is oblivious to the rest of the payload. It does store it, however, and clients running on the new schema expect the newly introduced fields to be returned intact.

In general any component could benefit from preserving unknown fields where only a partial understanding of the message is needed, especially where the bits the component does care about does not change often, but the rest of the schema does. I can think of routing, storage, certain types of data transformation, etc.

I would be interested in knowing how you managed to solve these use cases (which I'm sure you have internally at google) without preserving unknown fields.

InfinitiesLoop · 2016-06-18T00:51:07Z

We need unknown fields, because it's one of the ways we know on the server-side that our proto definition is out of date, and needs to be re-synchronized. Without unknown fields, we would have to resort to polling or some other less authoritative way of detecting when the client has added fields.

Also while I understand trying to reduce feature surface area, unknown fields don't exactly cause a problem, do they? Dropping them has more negatives than positives, please add them back to proto3.

JesseChisholm · 2016-06-18T01:38:16Z

If the proto3 way was to set some option, like option (ProtoOptions).preserveUnknownFields = True; that would allow those of use who need it to keep it and those of you who don't need to do without it.

Best of both worlds. :)

dhendry · 2016-06-28T14:20:05Z

I would absolutely want the ability to preserve or strip unknown fields at runtime. There are levels of our system which get deployed regularly, are kept up to date, and should be validating the well known schema (and stripping unknown fields), but there are other internal layers which get deployed far less frequently, that are not directly exposed to clients or potentially malicious actors where preserving unknown fields is highly desirable so we dont have to do full and extensive deploys for every little change.

rohitsaboo · 2016-07-14T02:10:12Z

Hey guys,

We would love to have this feature, too :) During my relatively long time at Google, I was aware of many services that relied on this behavior from proto2.

Essentially, think of any set of three or more services where A talks to C via B, and we don't want to redeploy B when a proto that is being passed between A and C gets a new field added to it. (I also posted this as a question on stackoverflow.)

Would be great to have an update for supporting this feature and/or an alternative mechanism that you believe can solve this problem for us.

Thanks,
Rohit

jeremyong · 2016-08-04T18:03:17Z

Still no word on what the original justification was too.

Kaiserchen · 2016-08-24T09:22:53Z

The use-case we have is the following:

We use Stream Processors, namely kafka-streams, that rearranges protobuf messages. For example we have 2 streams of protobuf messages that we join with each other. The join will just output a joined message having the two others as fields. Sometimes we also aggregate streams to list of messages of previous streams. The stream processors only know about the fields relevant for them (join fields, group by fields ...) all the other fields are carried along as unknown-fields.

This allows the stream processor to continue working even when upstream schema changes happen, we do not need to redeploy our stream processing application, and the new fields end up in the output for free.

To add some drama: I think loosing the unknown fields will force us to move to avro

matthewrj · 2016-08-31T02:12:35Z

This is a bit of a deal breaker for us too. We have the same use case where A sends data to B which reads some fields and forwards the message to C. We don't want to have to constantly update B when the schema changes even though it doesn't read any of the new fields. The current behaviour is quite dangerous since C can't tell if one of the new fields was set to the default value or if B is just out of date and lost data.

InfinitiesLoop · 2016-08-31T17:57:15Z

Would really appreciate an update on the feedback here. Whether Proto3 is going to ever support unknown fields can impact decisions being made even for folks still on Proto2, because if it isn't, we may need to invent other ways of solving our problems in order to avoid rearchitecting things when/if we move to proto3.

chmod007 · 2016-10-05T20:45:18Z

I have two use cases, both of which have sub-optimal workarounds:

Include a signature in the same protobuf as the payload to be signed. To verify the signature, I deserialize, extract and remove the signature, reserialize and verify the signature. This breaks if the signed message contains any new fields unknown to the process verifying the signature. The workaround is to serialize in two levels, with the inner (signed) message serialized as bytes in the outer message.
A server is the ultimate source of small update packets that are then routed peer-to-peer. Unserializing and reserializing before passing the message on to other peers strips out unknown fields. The workaround is for peers to share the original bytes instead of deserializing and reserializing.

acozzette · 2016-10-07T17:22:25Z

One thing to keep in mind is that proto2 is not going away. We are still actively improving it and plan to keep doing so indefinitely, so proto2 is still a good choice if you have a use case that depends on unknown fields. The one main drawback is that a few languages (such as C# and Ruby) are currently proto3-only, but if you're not using those languages then that's not a problem.

@chmod007 , have you thought about using proto2 for your two use cases? Is that possible or do your schemas have to be proto3 for another reason?

Xorlev · 2016-11-18T03:57:18Z

I'll add a few usecases.

We have a gRPC service proxying RPC traffic. It would be awfully nice to not have a hard requirement to deploy the proxy first upon schema changes in any of the services it proxies.
We also maintain stream processing services which are processing protos from other parts of the organization. If they add a field, I'd prefer that field doesn't disappear unexpectedly just by flowing through our stream processor. There's some pretty awful documentation / tooling / coupling implications of needing to redeploy stream jobs any time upstream producers evolve their schema. Depending on any cycles in data flows, there may be no topological order that produces valid schema updates without doing a 2-step deploy: 1) upgrade proto schema, redeploy all the (many) things that might rely on it 2) update producer to fill in field, deploy producer. Pray all the systems were updated.

re: proto2 vs. proto3, it's kind of annoying to mix and match. It's pretty counterintuitive to only use proto2 to maintain unknown fields, but have proto3 definitions for gRPC servers. I agree with most of the design choices in proto3 (e.g. removing optional/required fields, map types), but not this.

I'd actually been unaware proto3 removed unknown field support until I expected it to maintain an unknown field and it didn't (and came to report it as an issue). I'd touted unknown field support as a huge selling point for protobufs when we'd first implemented them.

The protobuf website originally recommended that new projects use proto3, which is why we'd adopted it, but this is a pretty huge issue for us. We'll likely be forking the compiler similarly to @gfecher as the proto3 ship has long since sailed and this behavior is very important to helping us produce robust infrastructure.

stevvooe · 2016-11-18T22:47:10Z

@pherl @xfxyjwf Do you have suggestions for how to work around this with proto3? If this was removed, what techniques were used to avoid requiring this pattern within Google?

As far as I see it, this was the chief benefit of protobuf:

+----------+                        +----------+
|          |   +----------------+   |          |
|          |   |                |   |          |
| Producer +--->  Intermediate  +---> Consumer |
|          |   |                |   |          |
|          |   +----------------+   |          |
+----------+                        +----------+

Producer and Consumer could be updated with new fields, while intermediate can remain on the same version. If intermediate is a proxy of sorts, then this is important.

jeremyong · 2016-11-18T22:53:45Z

@stevvooe We've been continuing to use proto2 for the intermediate proxy type thing since they are binary compatible. Throughout our codebase, we've been propagating proto2 everywhere since it's really annoying to maintain two different semantics for the proto definitions themselves but if you wanted, producer and consumer could use proto3.

I do have some plans eventually to do a separate C++ compiler entirely that consumes proto3 syntax but retains the API of the unknown fields unless someone else gets to it first. I want to do other changes like using more STL containers (vectors and maps) as the backing in-memory storage and fix the oddities with the arenas we've been seeing.

vozbu · 2017-09-07T11:56:53Z

@pherl, the pattern "save unknown fields and then discard it" seems excessive for me. Isn't it better just to pass a flag to parsing function telling it to save or not to save unknown fields while parsing? It will save you memory and CPU in case you don't need these fields while will retain all desired benefits. In our workflows we sometimes have most of fields in message as unknown, and I'm afraid that parsing it will degrade our performance.

Actually, I would like to have such flag in proto2 too.

liujisi · 2017-09-07T18:15:48Z

@vozbu what language are you using? We do have API to skip unknowns fields in Java. Other languages chose to have a discard unknown fields API after parsing is finished mostly to reduce the complexity in implementation.

vozbu · 2017-09-08T06:33:34Z

@pherl, I'm talking about C++. I haven't seen the implementation to judge about it. I speak my thoughts as a user.

jbolla · 2017-09-13T23:32:36Z

@pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using:
final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; }
So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change.

liujisi · 2017-09-14T00:24:30Z

The plan would be only to provide APIs for explicitly drop unknowns, for those who depend on the behavior. The default is only for testing only. In 3.5 we will flip the default.

…

On Wed, Sep 13, 2017 at 4:32 PM jbolla ***@***.***> wrote: @pherl <https://github.com/pherl>, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns." 3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using: final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; } So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#272 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATQyUtZy8f6n6c-aVRnPPPSXV0oKyGuks5siGYYgaJpZM4D8C3u> .

liujisi · 2017-12-11T20:55:55Z

All languages will be fixed in 3.5.x releases.

leighmcculloch · 2018-07-17T18:31:23Z

@liujisi Now that direction has changed and support added for preserving field to some implementations, will this recommendation in the official proto3 documentation be changing?

Proto3 implementations can parse messages with unknown fields successfully, however, implementations may or may not support preserving those unknown fields. You should not rely on unknown fields being preserved or dropped. For most Google protocol buffers implementations, unknown fields are not accessible in proto3 via the corresponding proto runtimes, and are dropped and forgotten at deserialization time.

Ref: https://developers.google.com/protocol-buffers/docs/proto3#unknowns

acozzette · 2018-07-17T21:11:31Z

@leighmcculloch Good catch, I'll update that documentation to say that unknown fields are now preserved for proto3 messages as of version 3.5.

MalteJ · 2018-08-26T10:59:11Z

Is there a public method to detect if a deserialized message has unknown fields?

This would be useful to check a message which is coming from an untrusted source.
I do not want to relay the message to other services if I am not sure it complies to my proto format. Also in my case I cannot reserialize it because the serialized messages bytes are cryptographically signed (the serializer is not deterministic across different protobuf implementations).

I'm about to replace protobuf with JWT for this :(

MalteJ · 2018-08-26T11:33:12Z

There are methods to get a list of unknown fields. But:

In Go the parameter name suggests it should not be used ("XXX_unrecognized").
And the C++ docs say:

Get the UnknownFieldSet for the message.

This contains fields which were seen when the Message was parsed but were not recognized according to the Message's definition. For proto3 protos, this method will always return an empty UnknownFieldSet.

https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message#Reflection.GetUnknownFields.details

dsnet · 2018-09-04T02:08:43Z

In Go, there is not currently a reliable way to programmatically interact with unknown fields. At best, you can use proto.DiscardUnknown to recursively discard all unknown fields. However, there is no stable API to iterate and/or modify the current set of unknown fields.

Furthermore, not all unknown fields are stored in XXX_unrecognized, unknown fields in the extension ranges are stored in proto.XXX_InternalExtensions. The current state of affairs is unfortunate, and we're working on v2 of the API, which will provide a stable way to read, modify, and write unknown fields.

kditrj2d · 2019-03-18T21:51:34Z

I'm coming to this party rather late... I've just upgraded a C# application that uses protobuffers from version 3.4.0 to 3.6.1. The application relies on unknown fields not being preserved. Now by default they ARE preserved and I've seen a significant and unacceptable increase in memory consumption. (The ratio of known to unknown fields is about 1:5.) There is mention here of APIs being available to explicitly discard the unknown fields but its not clear to me whether these were temporary and have now been removed or still exist. What is the current situation? Do these APIs still exist in the version 3.6.1 C# distribution? If so where can I find details?

Xorlev · 2019-03-19T01:12:12Z

From my understanding (though I don't work on protobufs, I've just been a part of this thread for a long time), these APIs are here to stay -- you will be able to keep or discard unknown fields depending on your use case.

protobuf/csharp/src/Google.Protobuf/MessageParser.cs

Lines 333 to 340 in e479410

    
               /// <summary> 
        
               /// Creates a new message parser which optionally discards unknown fields when parsing. 
        
               /// </summary> 
        
               /// <param name="discardUnknownFields">Whether or not to discard unknown fields when parsing.</param> 
        
               /// <returns>A newly configured message parser.</returns> 
        
               public new MessageParser<T> WithDiscardUnknownFields(bool discardUnknownFields) => 
        
                   new MessageParser<T>(factory, discardUnknownFields); 
        
           }

Appears to be what you want -- applied to a MessageParser, it returns a new MessageParser which discards/doesn't discard unknown fields.

kditrj2d · 2019-03-19T08:46:40Z

Thanks for the reply. Found it, tried it, code now works again.

xfxyjwf added the question label Jan 20, 2016

prashantv mentioned this issue Oct 10, 2016

Keeping unknown fields in structs around thriftrw/thriftrw-go#42

Open

heyitsanthony mentioned this issue Oct 27, 2016

client versioning etcd-io/etcd#6579

Closed

ibrt mentioned this issue Nov 1, 2017

proposal: encoding/json: preserve unknown fields golang/go#22533

Open

liujisi added the to close label Dec 11, 2017

jtattermusch closed this as completed Dec 11, 2017

scottlamb mentioned this issue Feb 12, 2018

Codegen: unknown fields should be retained tafia/quick-protobuf#93

Open

dsnet mentioned this issue May 3, 2018

XXX_unrecognized broke code golang/protobuf#594

Closed

leighmcculloch unassigned liujisi Jul 17, 2018

dependabot-preview bot mentioned this issue Sep 1, 2018

Bump github.com/golang/protobuf from 1.0.0 to 1.2.0 risdenk/calcite-avatica-go#4

Closed

kurtisnelson mentioned this issue Feb 13, 2019

Improvements around empty data handling uber/simple-store#18

Merged

thaJeztah mentioned this issue Mar 20, 2019

Bump protobuf 1.2.0, and re-generate moby/swarmkit#2837

Closed

andreamlin mentioned this issue Mar 29, 2019

Add ConfigV2 Validator googleapis/gapic-generator#2672

Merged

dsnet mentioned this issue May 23, 2019

I think the XXX_ field is real bad idea,what happen if i remove them from the code golang/protobuf#856

Closed

mapmeld mentioned this issue Jan 31, 2022

chore: Use nullable types in PatchExperiment [DET-6486] determined-ai/determined#3497

Merged

3 tasks

halibobo1205 mentioned this issue Feb 11, 2022

proto3 remove CodedInputStream.explicitDiscardUnknownFields halibobo1205/questions#6

Closed

synzhu mentioned this issue Sep 6, 2022

Implement ReputationManager ipfs/go-bitswap#581

Closed

yordis pushed a commit to yordis/protobuf that referenced this issue Dec 8, 2024

Update generated files and improve CI message (protocolbuffers#272)

841bca4

proto3 and unknown fields #272

proto3 and unknown fields #272

Comments

joshuarubin commented Apr 7, 2015

dhendry commented May 6, 2015

solicomo commented Jul 23, 2015

jeremyong commented Mar 14, 2016

jeremyong commented Mar 14, 2016

dhendry commented Apr 19, 2016

jeremyong commented Apr 19, 2016

liujisi commented Apr 20, 2016

jeremyong commented Apr 20, 2016

joshuarubin commented Apr 20, 2016

jeremyong commented May 6, 2016

gfecher commented Jun 12, 2016

xfxyjwf commented Jun 12, 2016

jeremyong commented Jun 12, 2016

gfecher commented Jun 13, 2016

InfinitiesLoop commented Jun 18, 2016 • edited Loading

JesseChisholm commented Jun 18, 2016 • edited Loading

dhendry commented Jun 28, 2016 • edited Loading

rohitsaboo commented Jul 14, 2016

jeremyong commented Aug 4, 2016

Kaiserchen commented Aug 24, 2016 • edited Loading

matthewrj commented Aug 31, 2016

InfinitiesLoop commented Aug 31, 2016

chmod007 commented Oct 5, 2016

acozzette commented Oct 7, 2016

Xorlev commented Nov 18, 2016 • edited Loading

stevvooe commented Nov 18, 2016

jeremyong commented Nov 18, 2016

vozbu commented Sep 7, 2017 • edited Loading

liujisi commented Sep 7, 2017

vozbu commented Sep 8, 2017

jbolla commented Sep 13, 2017

liujisi commented Sep 14, 2017 via email

liujisi commented Dec 11, 2017

leighmcculloch commented Jul 17, 2018

acozzette commented Jul 17, 2018

MalteJ commented Aug 26, 2018 • edited Loading

MalteJ commented Aug 26, 2018

dsnet commented Sep 4, 2018

kditrj2d commented Mar 18, 2019

Xorlev commented Mar 19, 2019 • edited Loading

kditrj2d commented Mar 19, 2019

InfinitiesLoop commented Jun 18, 2016 •

edited

Loading

JesseChisholm commented Jun 18, 2016 •

edited

Loading

dhendry commented Jun 28, 2016 •

edited

Loading

Kaiserchen commented Aug 24, 2016 •

edited

Loading

Xorlev commented Nov 18, 2016 •

edited

Loading

vozbu commented Sep 7, 2017 •

edited

Loading

MalteJ commented Aug 26, 2018 •

edited

Loading

Xorlev commented Mar 19, 2019 •

edited

Loading