Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] scoped enums with an option #1079

Closed
goldenbull opened this issue Dec 30, 2015 · 36 comments
Closed

[feature request] scoped enums with an option #1079

goldenbull opened this issue Dec 30, 2015 · 36 comments

Comments

@goldenbull
Copy link

Currently pb3 applies c98 enum scoping rule for enums. This is inconvenient when we need same Identifier in different enums. I hope there will be an new option for enums as follow:

enum Status {
  option scoped = true;
  START = 0;
  STOP = 1;
  PAUSE = 2;
}

enum Action {
  option scoped = true;
  NOP = 0;
  START = 1; // note this entry has a different value from Status::START
  STOP = 2;
}

With the annotation option scoped = true the compiler can use enum class for C++11, providing better code, and we users will feel more comfortable to use proper identifiers for enums. And this is totally compatible with current implementation, all existing code will not be disturbed.

@philsc
Copy link

philsc commented Jan 4, 2016

👍 I would love to have something like this! Even if I have to specify it as an option to the protoc invocation I'd be happy.

@lostindark
Copy link

I like this idea. The c98 enum scoping rule is super annoying when you want to use short enum values. Now you have to add prefix which made the enum very ugly in managed languages like Java/C#.

@goldenbull
Copy link
Author

One possible constraint is that some platforms do not have scoped enums support. For example, most embedded compiler do not support C++11 well.

Just for curiosity, besides C98 and object-C, are there any other languages which do not support scoped enums? I googled for some popular languages and seems all of them have scoped syntax.

@lostindark
Copy link

The solution to languages that doesn't support scoped enum is allow customize prefix for enum values:
option unscoped_enum_prefix = "eFoo";

@gegles
Copy link

gegles commented Jan 26, 2016

👍 I would love to have something like this as well!

@warrenfalk
Copy link
Contributor

Sorry, I deleted my comment, because I think I will try to implement option scoped instead.

But I think the opposite is true. "option scoped" will break existing generators that expect enum values to be unique at the enum's parent's level whereas "option auto_strip_enum_prefixes" can be safely ignored by all generators and then only implemented later. This principle of least breakage was why I had started going that route in the first place.

I've changed my mind only because I think that C++ will be the only generator that breaks with option scoped and I think that I can probably fix it, but I'm worried that the PR won't be accepted.

The question is what the C++ code generation constraints are. Is enum class allowed? Or should the C++ generator automatically add a prefix (the enum's name for example) when option scoped is encountered?

@goldenbull
Copy link
Author

For some embeded system, C98 is still widely used, so I think protoc should generate C98 compatible code. One possible choice is that the option scoped forces protoc to generate C++11 code, and user should be aware of this behaviour.

@warrenfalk
Copy link
Contributor

Turns out that current design makes option scoped quite a bit more difficult than it would seem. The problem is that the code that does the validation and throws the error happens prior to the code that interprets the options, and between these two things is an ominous comment that says "Note that the following steps must occur in exactly the specified order". I think that the validation code can be moved to a different function that occurs after parsing, but this will be quite a bit more difficult as the logic in that context will probably have to be rewritten. See the following code:

  // Convert children.
  BUILD_ARRAY(proto, result, message_type, BuildMessage  , NULL);
  BUILD_ARRAY(proto, result, enum_type   , BuildEnum     , NULL);
  BUILD_ARRAY(proto, result, service     , BuildService  , NULL);
  BUILD_ARRAY(proto, result, extension   , BuildExtension, NULL);

The scope validation happens in the BUILD_ARRAY for enum_type above and throws the error

  // Copy options.
  if (!proto.has_options()) {
    result->options_ = NULL;  // Will set to default_instance later.
  } else {
    AllocateOptions(proto.options(), result);
  }

  // Note that the following steps must occur in exactly the specified order.

The above comment seems to rule out any re-ordering of operations

  // Cross-link.
  CrossLinkFile(result, proto);

  // Interpret any remaining uninterpreted options gathered into
  // options_to_interpret_ during descriptor building.  Cross-linking has made
  // extension options known, so all interpretations should now succeed.
  if (!had_errors_) {
    OptionInterpreter option_interpreter(this);
    for (vector<OptionsToInterpret>::iterator iter =
             options_to_interpret_.begin();
         iter != options_to_interpret_.end(); ++iter) {
      option_interpreter.InterpretOptions(&(*iter));
    }
    options_to_interpret_.clear();
  }

We don't have options until this point.

I'm still going to give it a shot, but my C++ knowledge has atrophied over many years of disuse.

@warrenfalk
Copy link
Contributor

I think that "option scoped" should only signify that the .proto writer means the enum value names to be scoped as children of the enum rather than siblings. Then it would be up to each language generator to behave in whatever way is appropriate for its language given this option. For all generators, except C++, this means just using the enum values as they are because most languages already scope the values that way.

For C++ we could generate enum class but I don't think we should unless the user has explicitly opted into C++11 generation, which should be a separate option (probably to protoc as suggested in #67 where xfxyjwf has said that it probably won't happen). So I think the C++ generator should find a way to generate C98 code when the option is present, which I think means prefixing these values with the enum name, as this is the prevailing way to handle this situation when it's handled manually. And then if an option to allow C++11 is added later, this behavior can take that into account.

@lostindark
Copy link

The way to generate a C98 code for C++ is to add another option:
option unscoped_enum_prefix = "eFoo";
The name can be cpp_enum_prefix is only C++ is affected by this naming constrain (I'm not quite sure about other languages like Ruby/Go etc).

So it looks like this:

  1. No scope and unscoped_enum_prefix.option
    Existing behavior
  2. scoped option only
    generated C++11 enum class
  3. unscoped_enum_prefix option only
    Add prefix to C++ enum.

People can choose # 2 for newer code with C++11 support, choose # 3 to be compatible with C98 c++ compiler while enjoy the short enum name in other language, and choose # 1 for max compatibility with proto2.
I agree this seems introduce too much options for a simple feature, but I haven't figured out better and easier ways.

@warrenfalk
Copy link
Contributor

Options that are non-language-specific, like scoped should have non-language-specific meanings only, which the language generators should translate into code according to their language's semantics.

The meaning of scoped is that the enum value names are scoped to the enum and not the parent of the enum - this is a language-independent meaning.

A C# or Java generator can just ignore this option as their semantics already have this behavior.

A C++ generator that is limited to C98 just has to find a way to achieve the same purpose within the constraints of C98. A C++ generator that is not limited to C98 (perhaps in the future by default or by option) could use enum class.

In other words, a Java developer that adds option scoped should not have to know that this will make his .proto unusable by embedded C++ projects.

The idea of a cpp_prefix enum-only option seems fine and if present, a C++ generator limited to C98 can add the prefix and ignore scoped.

@goldenbull
Copy link
Author

How about option scoped_or_prefixed = true?
This is a language-neutral option. C#/Java will just simply ignore it, and C98 can use enum name as the prefix.

@philsc
Copy link

philsc commented Mar 16, 2016

Would it be hard to do something akin to what option cc_enable_arenas = true; does in .proto files?

In other words, would it be easier to add something like option cc_enum_classes = true; on a per .proto basis?

@goldenbull
Copy link
Author

Agree with @philsc , this option is better to be both enum level and proto file level.
Agree with @warrenfalk , the new option must be completely compatible with all existing assets, including .proto files, generated code, user application codes, and c98 compiler.
Seems adding one option enum_add_prefix is enough. For languages like C# and Java, just ignore this option. For C++ and other C98-semantic languages, enum name are automatically add to enum member as prefix. This option can be both proto level and enum level.
How about this idea? I will see if this can work with the current protoc code.

@warrenfalk
Copy link
Contributor

I agree with @goldenbull to have an option for C98-semantic languages to accommodate and others to ignore, but I still agree with @goldenbull's original name of option scoped because it doesn't demand how the value name scoping occurs, only that it should. So C# and Java who already "scope" enum values can ignore the option, but to ignore something like option enum_add_prefix without actually adding a prefix seems to violate the principle of least surprise.

I also like a proto-file-level option in addition to the enum option but would suggest the name option scoped_enums for the name for the same reasons. However, a proto-file-level option appears to be even harder than the enum-level option because of the current design. I'm not yet sure either is feasible1.

@philsc's idea of cc_enum_classes is a great idea, but I think separate. Cleaner C++ code is only half of the issue. The other half is the .proto C98 semantics which forces C# and Java code that looks like this:

Message.Value = MyEnumWithALongName.MY_ENUM_WITH_A_LONG_NAME_VALUE;

And so I think that cc_enum_classes should be used with option scoped to effectively allow C++ to also ignore option scoped like other non-C98 languages. But I suggest this should be an argument to protoc rather than a .proto because that allows embedded C++ devs to generate code compatible with their environment without editing files.


1 The problem is that the system is rigged to look for enums by name where the name is namespace.VALUE instead of namespace.Enum.VALUE and it does this before the options are fully parsed. This makes it exceedingly difficult to have an option that changes the behavior of naming.

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Mar 16, 2016

Hi, thanks for these thoughtful discussions and sorry for not chiming in sooner. As @warrenfalk already discovered, in protobuf implementation, an enum value is referenced as "namespace.VALUE" rather than "namespace.Enum.VALUE". When we discuss languages with/without scoped enum support, you probably don't think proto itself as a language, but it kind of is and most unfortunately it doesn't support scoped enums. Every element defined in a .proto file has a globally unique fully qualified name, and for enum values, the fully qualified name doesn't have the enum type's name in there. That's why enum values in two difference enum types in the same scope cannot have the same short name regardless of whichever programming language you are generating the code for, or whichever options are specified in .proto file. Changing this would be infeasible, because protocol compiler is not the only tool that parses .proto files and we actually have persistent data containing things like "namespace.VALUE". It's simple to change the protobuf implementation in this github site to do whatever we want, but we won't be able to push this change to Google's internal code base. Anyone who tries to use the same short name for two enum values within the same scope is likely to cause breakages to various parts of Google's systems here and there. It's not uncommon to see the use of a new proto option to cause breakages and has to be rollbacked. Let alone a more fundamental and incompatible change to the proto language.

I am sorry to say this, but "option scoped = true" as proposed in @goldenbull 's first post (and similar ideas to make protoc allow same short name for different names) cannot be supported.

The other ideas like "option auto_strip_enum_prefixes" is more feasible but I guess it's also far less appealing.

@warrenfalk
Copy link
Contributor

So basically an option will never be allowed to change the semantic output of the .proto file because readers need to be able to ignore options and still understand the semantic meaning of the protocol buffer definitions.

This is most unfortunate, indeed, as in retrospect this C98 scoping behavior might be the biggest flaw in the design.

It's a shame this wasn't addressed as part of proto3. Is something like this possible in any future version, such as proto4?

By the way, for anyone that cares, I was able to get option scoped working in protoc with all tests passing if it's any use to anyone. It's in the "scoped" branch on my fork, if anyone is interested in it. I will now go back to looking into auto_strip_enum_prefixes.

Also by the way, the idea to have a corresponding auto_prefix option won't work for the same reason that scoped won't. So you will have to manually prefix your enum values and then rely on the option to signal to language generators that it is OK to strip the prefixes. Note that if C++11 is enabled somehow, even the C++ generator can generate enum class and strip prefixes too.

@xfxyjwf , you suggested in #67 that you did not like the idea of giving the C++ generator permission to use C++11 features via an argument to protoc. Can you provide more insight into that? It would be very convenient for a language generator to know that the user is OK with some C++11 features. Having this would allow the generator to emit enum class in response to auto_strip_enum_prefixes for example, while not impacting those still limited to C98. I'd be interested in your thoughts on that.

@goldenbull
Copy link
Author

How about workaround this way:
For each enum member, we can define a "friendly scoped alias name" in option, so C#/Java generator can use this alias as field name

enum Status {
  Status_START = 0 [scoped_alias="Start"];
  Status_STOP = 1 [scoped_alias="Stop"];
  Status_PAUSE = 2 [scoped_alias="Pause"];
}

@goldenbull
Copy link
Author

Thanks to the heated discussion, I realized that what I actually want is just more pretty readable code. Any possible solution should not change the existing semantics, but to provide something like syntax sugar. Will the option scoped_alias break or change the proto descriptor? Will it cause trouble to have a different field/property name in C#/Java?

@warrenfalk
Copy link
Contributor

scoped_alias and auto_strip_enum_prefixes are not mutually exclusive, and I'm not opposed to having both, but scoped_alias on its own seems needlessly verbose in most cases. If you have to prefix your enum values anyway and you can prefix them predictably then why not have them stripped automatically? Personally, to me, this is far preferable.

But I'm not against allowing either option or even both at the same time. When both present, I think scoped_alias should override auto_strip_enum_prefixes.

What I know is that scoped_alias can't change the actual name of the enums, because no option can do that. What it can do is offer any generator the ability to use a different name when generating code. I'm confident that if auto_strip_enum_prefixes is allowed to do this, then scoped_alias can too, with one possible caveat being that scoped_alias can be abused to make confusing code such as:

enum Status {
  Status_START = 0 [scoped_alias = "STOP"];
  Status_STOP = 1 [scoped_alias = "START"];
}

Creating a scenario in which Status.STOP in Java means the same thing as Status_START in C++. But I'm assuming that if a developer wanted to do something like that, protobuf wouldn't stop him.

What I don't yet know is how this affects reflection. I know that C++ generated code and reflection would just use the original value name ignoring scoped_alias. I know that Java could use the scoped_alias value in code, but I don't know if it could also use it in reflection. I.e. if you use the generated reflection code, I am not sure if it could be made to see it as "Start" or would have to see it as "Status_START". Perhaps @xfxyjwf can answer that.

And in the interest of thoroughness, I see in @goldenbull's example, he also made his aliases Pascal-case (which is C# convention that I'd also like to support in the C# generator). I have been thinking about a global option to try to allow that automatically, but I expect such an option to be language specific. I also prefixed the values with the exact name of the enum Status. The code I'm working on for auto_strip_enum_prefixes will compare more forgivingly. If your enum is named MyEnum, then auto_strip_enum_prefixes will remove MyEnum, MyEnum_, My_Enum_, MY_ENUM_, etc. (case and underscores are ignored).

@goldenbull
Copy link
Author

In most cases auto_strip_enum_prefixes is enough, however scoped_alias can provide even stronger ability, for example, to define an enum member in Chinese characters. Maybe this sounds over-design, but in the projects I'm working on, it's really useful to have Chinese name for each enum member. But seems proto file is ASCII encoded only, so maybe it's impossible to have UTF8 identifiers.

@warrenfalk
Copy link
Contributor

@goldenbull, that makes sense to me. I am planning to add scoped_alias to my work on auto_strip_enum_prefixes as long as I don't encounter any resistance in code or project owners.

@goldenbull
Copy link
Author

@warrenfalk great, so I can be lazy and wait for you 😄

@warrenfalk
Copy link
Contributor

@goldenbull, I have implemented auto_strip_enum_prefixes and scoped_alias for both C# and Java, and the Java output looks good to me, but I am not able to test it until later. Is that something you'd mind testing out? It's in the auto_strip_enum_prefixes branch in my fork. If not, it's no big deal, I'll just submit the PR after I get a chance to test it.

Eventually I will add js support, too, and maybe some others.

@goldenbull
Copy link
Author

@warrenfalk great, I just can not wait to test 😄

@goldenbull
Copy link
Author

@warrenfalk Here is my testcase

syntax = "proto3";
package TestEnumPrefix;
option csharp_namespace = "NsTestEnumPrefix";
option auto_strip_enum_prefixes = true;
enum EType {
    etYpe_name1 = 0; // should be stripped to "name1"
    eT_y_pe_name2 = 1; // should be stripped to "name2"

    Name3 = 10; // should be "Name3"
    eTypeName3 = 11; // should not strip prefix, conflict with above line
    EType_Name3 = 12; // should not strip neither

    etype_Name4 = 20; // should not strip, conflict with following line
    Name4 = 21; // should be "Name4"

    this_is_a_very_very_long_enum_member_name = 30 [scoped_alias="ShortName"]; // alias = ShortName
    this_is_another_very_very_long_enum_member_name = 31 [scoped_alias="ShortName2"]; // alias = ShortName2
}

enum enum_type_test
{
    EnumTypeTest_v1 = 0; // should be "v1"
    Enum_Type_Test_v2 = 1; // should be "v2"
    enum_type_test_v3 = 2; // should be "v3"
}

message MyClass
{
    EType v1 = 1;
    enum_type_test v2 = 2;
}

And this is the generated code:

  public enum EType {
    /// <summary>
    ///  should be stripped to "name1"
    /// </summary>
    name1 = 0,
    /// <summary>
    ///  should be stripped to "name2"
    /// </summary>
    name2 = 1,
    /// <summary>
    ///  should be "Name3"
    /// </summary>
    Name3 = 10,
    /// <summary>
    ///  should not strip prefix, conflict with above line
    /// </summary>
    eTypeName3 = 11,
    /// <summary>
    ///  should not strip neither
    /// </summary>
    EType_Name3 = 12,
    /// <summary>
    ///  should not strip, conflict with following line
    /// </summary>
    etype_Name4 = 20,
    /// <summary>
    ///  should be "Name4"
    /// </summary>
    Name4 = 21,
    /// <summary>
    ///  alias = ShortName
    /// </summary>
    ShortName = 30,
    /// <summary>
    ///  alias = ShortName2
    /// </summary>
    ShortName2 = 31,
  }

  public enum enum_type_test {
    /// <summary>
    ///  should be "v1"
    /// </summary>
    v1 = 0,
    /// <summary>
    ///  should be "v2"
    /// </summary>
    v2 = 1,
    /// <summary>
    ///  should be "v3"
    /// </summary>
    v3 = 2,
  }

...
    /// <summary>Field number for the "v1" field.</summary>
    public const int V1FieldNumber = 1;
    private global::NsTestEnumPrefix.EType v1_ = global::NsTestEnumPrefix.EType.name1;
    public global::NsTestEnumPrefix.EType V1 {
      get { return v1_; }
      set {
        v1_ = value;
      }
    }

    /// <summary>Field number for the "v2" field.</summary>
    public const int V2FieldNumber = 2;
    private global::NsTestEnumPrefix.enum_type_test v2_ = global::NsTestEnumPrefix.enum_type_test.v1;
    public global::NsTestEnumPrefix.enum_type_test V2 {
      get { return v2_; }
      set {
        v2_ = value;
      }
    }
...

Seems everything works fine! Great job! 👍

@warrenfalk
Copy link
Contributor

Awesome, thanks, I'll submit the PR

@goldenbull
Copy link
Author

@warrenfalk string type in proto can not recognize non-ASCII characters, how about bytes type?

@warrenfalk
Copy link
Contributor

@goldenbull, do you mean make the scoped_alias option be of type bytes? I might be misunderstanding you; the following seems to already work. Did you have something else in mind, or a specific example?

enum MyOtherEnum {
    MyOtherEnumUnspecified = 0 [scoped_alias = "\U0000672A\U00006307\U00005B9A"];
    MyOtherEnumOne = 1;
    MyOtherEnumTwo = 2;
}

Generates Java code like this:

public enum MyOtherEnum
      implements com.google.protobuf.ProtocolMessageEnum {
    /**
     * <code>未指定 = 0</code>
     */
    未指定(0, 0),
    /**
     * <code>One = 1</code>
     */
    One(1, 1),
    /**
     * <code>Two = 2</code>
     */
    Two(2, 2),
    UNRECOGNIZED(-1, -1),
    ;
    /* ... */
}

@goldenbull
Copy link
Author

@warrenfalk yes, string is enough with \U encoding. I didn't know it before...

@goldenbull
Copy link
Author

hi @warrenfalk , could you tell me how to generate descriptor.pb.h and descriptor.pb.cc from descriptor.proto? I tried protoc but the generated descriptor.pb.h doesn't have the LIBPROTOBUF_EXPORT macro. 😞

@goldenbull
Copy link
Author

oyeah I got it

protoc.exe -I.  --cpp_out=dllexport_decl=LIBPROTOBUF_EXPORT:.  google\protobuf\descriptor.proto

@khingblue
Copy link
Contributor

To whom might be interested, I just created a new repository for C++11 feature support in Protobuf:
https://github.com/khingblue/protobuf-cpp11

The first feature is based on @warrenfalk's change to support scoped enum.

@liujisi liujisi closed this as completed Mar 6, 2017
@timotheecour
Copy link

see workaround here: #67 (comment)

@TehBakker
Copy link

So did anything came out of this and was merged ?

@git-torrent
Copy link

Any updates on this? We have 2022 and embedded compilers are likely C++11 positive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests