Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary boxing with String.Concat #415

Merged
merged 1 commit into from
Feb 16, 2015

Conversation

stephentoub
Copy link
Member

When string concatenation encounters non-strings (e.g. path + '/', name + someInt32, etc.), it calls overloads of String.Concat that accept objects. These overloads then just call ToString on each of the objects, mapping to the C# spec which states that "any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object." When any of the individual items being concatenated is a value type, that object first gets boxed to be passed to String.Concat as an object, only to then have ToString called on the boxed object.

This commit changes the local rewriter for string concatenation to test whether an argument is a value type, and if it is, to call ToString on it directly, rather than first boxing it. This can then affect which overload of Concat is used, as the type of the argument has changed to be String. The primary benefit of this is saving the allocation per value-type item. There are some secondary benefits, as well; for example, as there is a four-string overload of Concat but no four-object overload of Concat, if this optimization is able to force all of the items to be strings, the four-string overload can be used rather than allocating an object array for the inputs and then another string array for the resulting strings (inside of Concat).

(Beyond this optimization, there's the potential for another, which converts literal char arguments to be literal string arguments, avoiding the need for ToString's allocation altogether. But as that doesn't stick to the letter of the spec as I read it, I've not done so.)

There are lots of places where such an optimization is beneficial. Just searching through the Reference Source, there are many examples where code is unnecessarily boxing due to this. Just a few examples:

Roslyn itself bumps up against this, for example:

Etc.

{
// The lowered expressions coming in will be boxing conversions if the input is a value type.
// When passed to String.Concat, the resulting object will have ToString called on it.
// We can optimize away the boxing allocation by just calling ToString on the value type directly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a breaking change and a deviation from C# spec in some cases. If the value type is mutable and its method ToString() is mutating, then currently this mutation is not observable because it's made on a boxed copy. With this change it will directly mutate the value, that can have observable effects. This looks like a safe optimization only for known immutable value types like int, Guid, DateTime etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The majority of the cases I've seen this occur with are in fact with chars, ints, bytes, and just a handful of other core types, so I think the decrease in value for the change would be fairly negligable. If you agree this is worthwhile to pursue, I'll go ahead and make that change...?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should use a whitelist of most common immutable value types. Any breaking changes should be first approved by C# LDM or Roslyn Compat Council.

I'll go and add a test that asserts the current behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VladimirReshetnikov do you consider it a breaking change to call ToString on the immutable value types? Or are you just referring to the current behavior which would call it on any struct?

@jaredpar
Copy link
Member

CC @gafter for his feedback.

The change itself looks good modular @VladimirReshetnikov comment about mutations.

@stephentoub
Copy link
Member Author

Thanks, guys. I've updated the PR based on @VladimirReshetnikov 's feedback.

// As such, we special case core types we know to be safe (and the most
// important ones based on usage, anyway).

switch (operand.Type.SpecialType)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely add enums, Guid, DateTimeOffset, TimeSpan, and nullable versions of all whitelisted types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all enums map to SpecialType.System_Enum? I wasn't sure if that referred to the base Enum type or to any enum.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, SpecialType.System_Enum indicates the abstract class System.Enum. Enums (concrete value types) have their property TypeKind == TypeKind.Enum

@stephentoub
Copy link
Member Author

Updated based on the additional feedback.

/// <summary>
/// Checks if a type is an immutable value type primitive.
/// </summary>
public static bool IsImmutableValueTypePrimitive(this SpecialType specialType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this method? If it is a value type and has any SpecialType, then it should be safe. No?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gafter That places an undocumented restriction on the future expansion of the SpecialType enumeration. While it's unlikely to ever be a problem, that's certainly a dangerous position to put yourself in when it can be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @sharwell

Better to find a place down the road where we forgot to do this optimization vs. finding out we accidentally started doing it to a type where it is not safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like I should leave it as is for now. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really trying to "defend" against the possibility that we'll add a platform special value type that has a mutating ToString() method? That seems really, really unlikely. I'd rather see this method removed and have the simpler test in the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gafter, I don't feel strongly about it. I removed the extension per your feedback.

@pharring
Copy link
Contributor

Nice! 👍

// important ones based on usage, anyway).

TypeSymbol typeSymbol = operand.Type;
if (typeSymbol.SpecialType.IsImmutableValueTypePrimitive() || typeSymbol.TypeKind == TypeKind.Enum)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider including TimeSpan, DateTimeOffset, Guid, Version and Nullable<T> for immutable T.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good suggestion. I was aiming to keep this fairly lean; is it cheap to look up such arbitrary types? If so, I can do so; otherwise, I'd suggest we leave this as is for now, and it can always be updated later with more types if necessary.

@stephentoub
Copy link
Member Author

Thanks, all. I rebased/squashed to address merge conflict with @VladimirReshetnikov 's additional test so that the Merge pull request button works in the GitHub UI.

@jaredpar
Copy link
Member

LGTM as well.

JaredPar from a phone
http://blog.paranoidcoding.com/


From: Vladimir Reshetnikov [email protected]
Sent: Wednesday, February 11, 2015 7:56:00 PM
To: dotnet/roslyn
Cc: Jared Parsons
Subject: Re: [roslyn] Avoid unnecessary boxing with String.Concat (#415)

[:+1:]

Reply to this email directly or view it on GitHubhttps://github.com//pull/415#issuecomment-74013669.

@aelij
Copy link
Contributor

aelij commented Feb 12, 2015

While a bit out of scope for Roslyn, I think it might also be beneficial to have a generic version of String.Concat with up to 3-4 type parameters, which would prevent boxing as well. The same idea could be applied to String.Format.

@sharwell
Copy link
Member

Does this change result in different output for code like the following?

string s = "Value: " + 3.1 + (Thread.CurrentCulture = {some other culture});

In the past, the 3.1 would be boxed, then the culture would be set, and finally the 3.1 would be converted to a string using the new culture. Following this change, the 3.1 would be converted to a string using the old culture.

@svick
Copy link
Contributor

svick commented Feb 12, 2015

In expressions like path + '/', I think there is still one unnecessary allocation, even with this change: the one-character string returned by char.ToString().

Since I think such code is very common, would it make sense to add special overloads of string.Concat() just for that (e.g. string.Concat(string, char) and string.Concat(char, string))? Or is there some other way to avoid the allocation (interning?)?

@stephentoub
Copy link
Member Author

@svick: Yes, I allude to that in my original PR description, at least in the case of literal chars. In your mentioned case, I think there's a limit to how many such overloads you can reasonably add to the libraries. That'd be two overloads for the two-parameter version (string/char, char/string), 6 for the three-parameter version (string/string/char, string/char/string, char/string/string, string/char/char, char/string/char/, char/char/string), etc., and that's just for the case of char; similar needs arise for Int32 and so on. @aelij above mentioned adding generic overloads; that would help to avoid the boxing (at the expense of potentially needing many generic instantiations), and is something folks like @KrzysztofCwalina are considering for this and other such cases for the future, but it also wouldn't address the string allocation you allude to (unless type tests were done in the implementation to special case each of the specific cases, which would theoretically be a possibility).

@sharwell: Yes, that would change the formatting; this can be seen with the following:

var dot = CultureInfo.GetCultureInfo("en-US");
var comma = CultureInfo.GetCultureInfo("fr-FR");
Thread.CurrentThread.CurrentCulture = dot;

Console.WriteLine("Value: " + (object)3.1 + ((Thread.CurrentThread.CurrentCulture = comma) != null ? "" : ""));
Thread.CurrentThread.CurrentCulture = dot;
Console.WriteLine("Value: " + 3.1.ToString() + ((Thread.CurrentThread.CurrentCulture = comma) != null ? "" : ""));

prints

Value: 3,1
Value: 3.1

So, yes, there is a potential behavioral change here. The thought of someone changing culture in the middle of string concatenation bothers me even more than the thought of someone mutating a mutable struct in the middle of a ToString call :), but nevertheless it is a behavioral change. @theoy , @pharring, thoughts? Up to you guys what we do here. I know such corner-cases behavioral changes have been taken for Roslyn... thoughts on this one?

@svick
Copy link
Contributor

svick commented Feb 12, 2015

@stephentoub Yeah, I agree that adding all those overloads would be way too much. But I think that having just the two two-parameter overloads would improve the common case while not adding too much.

and that's just for the case of char; similar needs arise for Int32 and so on

How come? How can you get the characters of Int32 without calling its ToString()?

@stephentoub
Copy link
Member Author

How can you get the characters of Int32 without calling its ToString()?

To avoid allocations you would just need to be able to determine its resulting length, and you can determine the number of base-10 digits in an Int32 with some math, at least for invariant culture (it's possible there's some culture I'm not aware of where this wouldn't hold.)

@pharring
Copy link
Contributor

Goodness, changing the culture as a side-effect of a string concat!
I'm OK with the change in behavior, although it's worth adding a unit test to show that we've thought about it. @VSadov, @theoy what do you think?

To steal an expression from someone else: @sharwell just Nikoved it (in reference to @VladimirReshetnikov, who is the person most likely to come up with bizarre edge cases).

@jaredpar
Copy link
Member

Bizarre indeed.

The C# spec seems to support @stephentoub change over the current behavior. It implies the ToString call will happen as a part of the concatenation operation instead of as an implementation detail of String.Concat

The operands are converted to the parameter types of the selected operator, and the type of the result is the return type of the operator.
Otherwise, any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object.

It wouldn't be the first time we deviated from the spec though.

@jaredpar
Copy link
Member

Why did we remove the method to explicitly check the SpecialType flag? That behavior is much more future safe than the current approach.

@stephentoub
Copy link
Member Author

@jaredpar, @gafter's strong feedback. I'm ok with whichever approach; someone from Roslyn just needs to tell me which direction to go :) I can undo the most recent commit if that's the desired direction.

@sharwell
Copy link
Member

The C# spec seems to support @stephentoub change over the current behavior.

I completely agree, and I see I failed to mention this in my reply. The unfortunate part of this is we are now following the spec, but only for a select few types. The way I interpret the spec is we should only ever call the Concat overloads which take string arguments when optimizing string concatenations. The naive method is for each element of the string concatenation:

  1. Load the value.
  2. If the value is a value type, box the value.
  3. callvirt to Object.ToString().

Then the resulting strings get passed to the appropriate overload of String.Concat.

💡 The optimization in this pull request should be restricted to eliminating boxing operations without also changing the order of operations. I recommend first updating the compiler to evaluate string concatenation arguments in the correct order, and then re-evaluating this optimization in that context. Edit: this will allow further optimizations such as the fact that it's always safe to evaluate ToString() on a readonly field without boxing, because the readonly constraint means the value is already being copied to a temporary variable in the current frame.

@gafter
Copy link
Member

gafter commented Feb 12, 2015

👍

@VladimirReshetnikov
Copy link

@stephentoub @sharwell Sam, this is an amazing example! To preserve the current behavior we would need to evaluate arguments to string.Concat from left to right, then invoke ToString() on those arguments that are immutable value types from left to right, and then invoke string.Concat. It might require some shuffling of values on the CIL evaluation stack. Interestingly, the C# spec does not specify an order in which ToString() methods of operands of a string concatenation operator are invoked. We can imagine that ToString() of one operand changes the current culture, and it affects ToString() of the other operand. Or they can set some global flags, or throw different exceptions. We likely want a test plan for this, and profiling results of real applications demonstrating a significant performance win that would warrant these changes. Any breaking changes need an approval form the Compat Council or LDM.

@VladimirReshetnikov
Copy link

@jaredpar Sorry, I do not follow you argument. Do you mean that the current behavior (without proposed changes) deviates from the spec. Can you give a concrete example?

@stephentoub
Copy link
Member Author

@VladimirReshetnikov, it sounds like you're saying this PR should just be closed (at least for now). Is that the consensus?

@VladimirReshetnikov
Copy link

I'd suggest to first add more test coverage around string concatenation and investigate if the current behavior differs from the spec.

@stephentoub
Copy link
Member Author

@VladimirReshetnikov, that's fine, but I don't have the time for that right now. If someone else would like to and has the time, great; we can leave this open pending that outcome. If not, and if that's required to move forward with this PR, then we should close it.

@jaredpar
Copy link
Member

Potential compromise:

Only invoke this optimization when the arguments are side effect free. Essentially doesn't call into a method or property. It should be completely safe to do the optimization in that case.

@stephentoub
Copy link
Member Author

@jaredpar, thanks, I'll take a look at doing something along that line of thought.

@VladimirReshetnikov
Copy link

@jaredpar @sharwell Yes, I see the deviation from the spec. It's the same in the native C# compiler and Roslyn. But, curiously, if a concatenation is part of an expression tree that is compiled and invoked, the order of evaluation is different and it follows the spec.

@ThatRendle
Copy link

Maybe C# 7 can add a pure modifier for properties and methods to help with mad corner cases like this?

@stephentoub
Copy link
Member Author

@jaredpar, I spent some time on that idea, got it all coded up, and then realized that it would end up being a significantly larger change to get it working correctly. As noted in some of the comments, the existing rewriter for this is very local, to the point where in these calls we may not actually have the full set of arguments supplied by the developer. I think this would require finding all such sibling expressions, and that's a much more invasive change.

@CyrusNajmabadi
Copy link
Member

Maybe i missed it, but why don't we just apply this optimization to things we know are pure? i.e. the standard primitive types like bool, char, and all the numerics? You'd get the allocation savings, and you wouldn't have to worry about any changes in behavior.

@gafter
Copy link
Member

gafter commented Feb 14, 2015

@CyrusNajmabadi System.Double is not pure by this definition, since it uses the current culture to decide what character to use for the decimal point.

@stephentoub
Copy link
Member Author

@CyrusNajmabadi, it's already limited as you say. The concern now is that this changes order of operations, and a subsequent expression in the concatenation could change the current culture in a way that would observably change the results.

@sharwell
Copy link
Member

@CyrusNajmabadi @gafter "pure" describes an object (or operation) which does not affect other objects. In this case, we are also concerned with the other way around - i.e. objects which cannot be affected by any other object (or operation). While the primitive types are all pure, the ToString() operation for many of them is affected by the current culture of the system.

@sharwell
Copy link
Member

@stephentoub Prior to resolving the ordering issue, we should be able to implement this optimization for types which are immutable, pure, and not affected by other code.

Notably:

  • bool
  • char (and this was one of the motivating types for this optimization)
  • IntPtr
  • UIntPtr

@stephentoub
Copy link
Member Author

Thanks, @sharwell. I'd considered that and dismissed it in a moment of thinking that it'd be so limited to not be worthwhile, but after reading your comment and thinking about it again, I agree it would make sense for at least char. I'll revise for that.

@stephentoub
Copy link
Member Author

Ok, take 2 :)

I took @sharwell's suggestion and limited the optimization to just the special value types that don't use culture (or any other state external to the value) in ToString, such that no subsequent expression in the Concat could affect the result of ToString. Since char is verified as such, I also added in the optimization to convert a char literal used in the Concat to be a string literal; this saves not only the boxing at run time but also the string allocation at run time, and depending on the shape of the Concat, it can also result in further constant folding. (Please let me know if there's some reason that this optimization should be backed out, and I'll do so.)

@sharwell
Copy link
Member

I'm glad you took the opportunity to improve the compile-time treatment of char literals as string literals here. I've watched many commits over the past few weeks do this by hand, and we'll finally be able to stop worrying about that. Further comments on the diff...

// calling ToString on subsequent arguments). For value types, it could mean mutating
// the original instead of the boxed copy. Therefore, we can only apply the
// optimization in cases where we know ToString to be non-mutating.
// - It's possible that subsequent expressions in the concatenation mutate state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Since the current behavior is inconsistent with the C# spec, you should include a note here that the attention to preserving this behavior is for strict preservation of the backwards-compatible behavior. Also, I still think we should wait for the "compatibility committee" (or whoever) to review the situation and determine whether or not we should change the Roslyn behavior to actually follow the spec.

❓ How long will it take for the compatibility evaluation to be complete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compatibility council is not currently considering this issue because there is no proposal to break compatibility. If such a proposal is made it would be evaluated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to proceed with the current, more-limited optimization, to get something in place now. If we want to submit a follow-up proposal to break compat by expanding the scope, that'll be a relatively trivial PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gafter Does #522 do the trick?

When string concatenation encounters non-strings (e.g. path + '/', name + someInt32, etc.), it calls overloads of String.Concat that accept objects. These overloads then just call ToString on each of the objects, mapping to the C# spec which states that "any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object." When any of the individual items being concatenated is a value type, that object first gets boxed to be passed to String.Concat as an object, only to then have ToString called on the boxed object.

This commit changes the local rewriter for string concatenation to test whether an argument is of an appropriate type, and if it is, to call ToString on it directly, rather than first boxing it. This can then affect which overload of Concat is used, as the type of the argument has changed to be String. The primary benefit of this is saving the allocation per value-type item. There are some secondary benefits, as well; for example, as there is a four-string overload of Concat but no four-object overload of Concat, if this optimization is able to force all of the items to be strings, the four-string overload can be used rather than allocating an object array for the inputs and then another string array for the resulting strings (inside of Concat).

This commit also includes an additional optimization specific to const chars; rather than doing a ToString call at run time, we can do it at compile-time and emit a literal string instead of a literal char, saving both the boxing and the string allocation at run time.
@stephentoub
Copy link
Member Author

@sharwell, I cleaned up the comment you cited a bit.

@gafter, I realized that the changes I'd made to the various 2-3-many arg rewriters were unnecessary, and I could do the change in a single place in the entry rewriter. This not only simplified the change, it also helped to improve constant folding in some cases and alleviated an array allocation in the many-arg rewriter.

I've pushed the updated commit.

I'm happy with this if others are. @gafter? @jaredpar? (@sharwell's official proposal to allow for more optimizations can then be considered separately.)

@gafter
Copy link
Member

gafter commented Feb 16, 2015

👍

1 similar comment
@jaredpar
Copy link
Member

👍

@stephentoub
Copy link
Member Author

Thanks, all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Tenet-Performance Regression in measured performance of the product from goals.
Projects
None yet
Development

Successfully merging this pull request may close these issues.