Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated module names contain spurious (er, superfluous) underscores #81

Closed
nolanamy opened this issue Apr 2, 2013 · 12 comments
Closed
Assignees

Comments

@nolanamy
Copy link

nolanamy commented Apr 2, 2013

Using a package name like package greeting_api; should generate code in a module like module GreetingApi. Instead, the underscore remains and we get module Greeting_Api.

Source: https://gist.github.com/nolanamy/5290516
Generated: https://gist.github.com/nolanamy/5290512

Is a continue needed at https://github.com/localshred/protobuf/blob/master/ext/ruby_generator/RubyGenerator.h#L128?

@localshred
Copy link
Contributor

Ironically, issue #56 is almost the exact opposite of this request. We used to ignore underscores in a namespace/package, but now we support them.

You are correct on the continue line, it was the only thing changed to fix #56.

I guess one way we could approach this is to remove underscores in a package name and leave them in a message name. This kind of feels weird to me, to have a double standard on "constantizing" a token.

What do you think?

@nolanamy
Copy link
Author

nolanamy commented Apr 2, 2013

Interesting! I agree, ideally, there's only one constantizing standard. In particular, message names must be constantized the same when they're declared and when they're used (see #82). Though it's hard to say what's the right thing to standardize on... Here are some options:

  • Only convert . to ::. This will keep things consistent (without requiring constantization of class names), but non-capitalized message names will cause syntax errors when the generated file is used.
    • The Style Guide specifies CamelCase (with an initial capital) for .proto message names, so in theory there should be no need to CamelCasify them.
  • Capitalize the first letters of segments, but leave underscores in place. This will require constantizing class names as well (to resolve underscore_separated message names are constantized inconsistently #82).
  • Capitalize first letters and remove underscores. Again, the same scheme will have to be applied to class names to keep it consistent.
  • Capitalize first letters and only remove underscores when followed by a lowercase letter. More complication than it's worth, I'd bet.

@nolanamy
Copy link
Author

nolanamy commented Apr 2, 2013

For now, I'll just use package GreetingApi to get the effect I was looking for. I'm using the same protobuf file in Java, so this might cause problems there, but it looks like option java_package = ... mostly overrides the package declaration...

It turns out Java uses the file name in much the same way that Ruby uses the package declaration; I was using the package declaration (and making it the same as the file name) to get the generated Ruby code to have more or less the same structure as Java.

@ghost ghost assigned localshred Apr 2, 2013
@localshred
Copy link
Contributor

tl;dr We should use ActiveSupport#camelize style camelization as our standard and fail when multiple separate messages compile to the same class name.


I agree we need to come up with a standard. In the ruby community the standard from my perspective would be the ActiveSupport camelize and underscore methods available on String instances. Yes it is from Rails, but there really isn't a more standard library most ruby developers are familiar with. I did a quick and dirty irb session to map out source strings and their camelized equivalents and got this table:

+-----------------+------------------+
| source          | camelized        |
+-----------------+------------------+
| simple_mail     | SimpleMail       |
| simple_Mail     | SimpleMail       |
| Simple_Mail     | SimpleMail       |
| simpleMail      | SimpleMail       |
| SimpleMail      | SimpleMail       |
| SimpleMAIL      | SimpleMAIL       |
| SIMPLE_MAIL     | SIMPLEMail       |
| SIMPLEMAIL      | SIMPLEMAIL       |
| simple_mail.foo | SimpleMail::Foo  |
+-----------------+------------------+

Simply adding the continue back on underscore chars, we would support all but the SIMPLEMail example. Weirdly the regex they are using in activesupport behaves that way.

Now, at this point you probably see the problem we have just uncovered. It existed all along, but working on this I realized the issue. You'll notice that the first 5 entries in the table all have different source tokens, but the ruby equivalent camelized class name is the same.

This presents us with a conundrum of sorts. We can either a) Fail when a camelized class name has already been taken by a class in that namespace; or b) Ensure first characters of message names are upcase to conform to ruby syntax for class constants and leave the rest of the token alone.

On the one hand, the compiler/generator obviously allows each of the source cases listed above since in c++ those tokens are entirely valid within that namespace, none of them step on each others toes.

On the other hand, by enforcing the ruby way of class naming with CamelCase (say by raising a compiler error when more than one message compiles with the same ruby class name), we potentially wreck a developers ability to name things as they would like. This reason is actually why in #56 I was ok with allowing class names to contain variables.


I feel like common sense will have to rule over flexibility here. Yes the cpp compiler allows you to define a simple_mail message and a Simple_Mail message, but you should never do that, it would be too confusing. Thus having the compiler enforce this message overlap issue by raising an error is acceptable to me.

Sorry for the brain dump, but you obviously got me thinking. Do you have any thoughts or is this proposal acceptable to you?

@liveh2o
Copy link
Contributor

liveh2o commented Apr 3, 2013

My two cents: I think it's very reasonable that for the purposes of compiling to Ruby, case in package names shouldn't matter. Most people will pick one of the styles you listed and stick with it, so I think failing when a class name has already been taken is a sane approach.

@nolanamy
Copy link
Author

nolanamy commented Apr 3, 2013

I'm tempted vote for leaving the token entirely alone and raising a compiler error if the first character is lowercase.

  • requires less work on our part (no need to search for name collisions)
  • simpler for the developer
    • error just tells them that ruby doesn't support lowercase class/message names
    • ruby tokens are identical to the .proto tokens (no mysterious name changes)
  • closest to what we're currently doing (requiring fewer changes to developers' ruby code)
    • when we change how we massage tokens, we'll break developers' existing ruby code (that's how I ended up submitting this issue!)
  • any proto source that works with the current generator and doesn't raise ruby syntax errors will still work fine
    • with camelize, some protos that work now will stop working when poorly-named messages collide

The danger in this approach is that existing .proto files (with lowercase names) that work with Java/C++ will require changes to work with the ruby protobuf generator; keep in mind, though, that they've never worked in ruby before.

The only way to avoid having to change proto files to work with ruby is to massage proto tokens one way or another, at the very least capitalizing the first letter. There are two main problems with changing tokens in the generator:

  1. Breaking proto code (due to token collisions)
    • proto files that used to compile will stop working if their names collide
    • the more we massage tokens, the more likely these collisions are (see @localshred 's camelize table above)
  2. Breaking ruby code (due to token changes)
    • code that works with the current generated code will break if tokens start getting changed (using camelize would break @dekz 's ruby code re. Underscore and references #56)
    • like collisions, the more we massage, the more likely we'll break someone's code (just capitalizing the first letter won't break the Underscore and references #56 code)

So, in my opinion, the less we change tokens, the better. If we do it at all, we should only capitalize the first letter. But I think leaving tokens entirely alone is the best solution. Minimal broken ruby code, and the only proto files that don't work never worked anyway.

@localshred
Copy link
Contributor

Thanks @nolanamy and @liveh2o for your suggestions, it's nice to have a few minds thinking through the implications.

I feel like the solution proposed by @nolanamy is the right way to go, that is, no token modification whatsoever. This obviously can lead to un-runnable ruby code if message/enum/service names do not have a capital first letter. I think in this case I'll simply raise an error. The compiler won't change any tokens in your generated ruby code, but it won't let you shoot yourself in the face either by generating bad code. It will guide you to runnable ruby definitions.

There is another artifact of token changing in the current compiler: we underscore any rpc method names. So if you defined rpc SomeMethod (Foo) returns (Bar);, the method would end up being compiled to some_method. Indeed, the server/service code in the gem also enforces that the method name is underscored when a request is received. Going with the policy of zero token manipulation, I would remove this change as well (in both the compiler and runtime server code).

And finally, @nolanamy you are correct that whatever change we make will break code. As such, my preference is not to release this with the 2.7 branch, but wait to make this change when we can release version 3 (which is coming). That way we can appropriately message gem users and an expectation of possible breakage will be understood.


So, to recap:

  1. No token manipulation, period.
  2. Compiler raises an error if message, service, or enum names begin with lower case.
  3. Pushed with version 3.

Everyone cool with this?

@nolanamy
Copy link
Author

nolanamy commented Apr 8, 2013

Sounds pretty good to me.

The only thing that irks me is that RPC method names are supposed to be CamelCased in proto code but ought to be snake_cased in ruby code. For everything else - Messages, Fields, and Enums - the protobuf style guide specifies a ruby-friendly style:

Proto type Protobuf style ruby style ruby type
Message CamelCase CamelCase class
Field snake_case snake_case variable
Enum types CamelCase CamelCase class
Enum values SCREAMING_SNAKE SCREAMING_SNAKE constant
RPC method CamelCase snake_case method

@liveh2o
Copy link
Contributor

liveh2o commented Apr 8, 2013

👍

@localshred
Copy link
Contributor

Yes @nolanamy I agree about the annoyance that their style guide is in conflict with ruby's, but I'd rather opt for not modifying tokens at all. Let people define things how they will. If they are rubyists they will probably define the methods as snake case anyways.

@nolanamy
Copy link
Author

nolanamy commented Apr 8, 2013

👌

@localshred
Copy link
Contributor

Resolved under > 2.8.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants