Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate non-aggregate string.Join to CONCAT_WS on SQL Server #28900

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

roji
Copy link
Member

@roji roji commented Aug 26, 2022

As usual, this was a bit more tricky than it looks.

  • The translation is in SqlServerSqlTranslatingExpressionVisitor because there's an array parameter.
  • CONCAT_WS is interesting in that the type it returns has a length based on its inputs, so concatenating 3 4-letter words with a 1-char delimiter returns a varchar(14). We don't have column/parameter values, so I'm setting the return mapping to be varchar(max) or nvarchar(max) (based on whether we've seen nvarchar or not). See below for
  • There's also CONCAT which is similar to CONCAT_WS. We already have a relational translation for string.Concat in StringMethodTranslator, but that works only for the overloads with 2-4 arguments, and not for 5+ (which has an array parameter). We could translate to CONCAT instead just like for CONCAT_WS, but then we should override the relational and do it regardless of number of args (shouldn't use different translations for different arg numbers). If you agree I can do this.

Closes #28899

Interesting experiments for CONCAT_WS result type
SELECT CONCAT_WS(', ', CAST('foo' AS varchar(max)), CAST('bar' AS varchar(max)));

SELECT CONCAT_WS(', ', 'foo', 'bar'); -- varchar(8), adds lengths of arguments + delimiter as necessary
SELECT CONCAT_WS(', ', CAST('f' AS varchar(1)), CAST('b' AS varchar(1))); -- varchar(4)
SELECT CONCAT_WS(', ', CAST('f' AS varchar(1)), 'bar'); -- varchar(6)
SELECT CONCAT_WS(', ', CAST('f' AS varchar(3)), 'bar'); -- varchar(8)
SELECT CONCAT_WS(', ', 'f', CAST('bar' AS varchar(max))); -- varchar(max)

SELECT CONCAT_WS(', ', 'foo', CAST('bar' AS char(3))); -- varchar(8), char expanded to varchar
SELECT CONCAT_WS(CAST(', ' AS char(2)), CAST('foo' AS char(3)), CAST('bar' AS char(3))); -- varchar(8), even though all arguments are char

SELECT CONCAT_WS(', ', N'foo', 'bar'); -- nvarchar(16), varchar treated as nvarchar

-- Look at this thing (one for the book):
SELECT CONCAT_WS('|', REPLICATE('x', 7999), 'bar'); -- returns ...xxxxxb ('ar' is truncated)

-- To find out an expression's type:
DECLARE @what sql_variant;
SELECT @what = 'some expression';
SELECT
    SQL_VARIANT_PROPERTY(@what, 'BaseType'),
    SQL_VARIANT_PROPERTY(@what, 'Precision'),
    SQL_VARIANT_PROPERTY(@what, 'Scale'),
    SQL_VARIANT_PROPERTY(@what, 'MaxLength');

@roji roji requested a review from smitpatel August 26, 2022 15:49
@smitpatel
Copy link
Contributor

SELECT CONCAT_WS('|', REPLICATE('x', 7999), 'bar'); -- returns ...xxxxxb ('ar' is truncated)

does that mean any result string crossing 8000 (or 4000 if unicode), will be truncated?

@smitpatel
Copy link
Contributor

Is this required for 7.0?

@roji
Copy link
Member Author

roji commented Aug 26, 2022

SELECT CONCAT_WS('|', REPLICATE('x', 7999), 'bar'); -- returns ...xxxxxb ('ar' is truncated)

does that mean any result string crossing 8000 (or 4000 if unicode), will be truncated?

Yes 🙄

If there's a single varchar/nvarchar(max) in there, the result is also max, so no truncation occurs. So this only affects the case where all arguments (and the delimiter) are non-max.

Is this required for 7.0?

No, not required.. The reason I did this is that we've added string.Join translation in another context (aggregate), so it's nice to be able to just say "we now support string.Join" (in both aggregate and non-aggregate contexts). It also seems very low-risk, but if we're against it we can do it for 8.0.

@smitpatel
Copy link
Contributor

It is not very "low-risk". I prefer not to do it. Not really a frequently asked feature.

@roji
Copy link
Member Author

roji commented Aug 27, 2022

Well, it's just a new translation for something we didn't translate before, so I'm not sure in what sense the risk can be high.

If you're concerned specifically with the truncation, we can introduce a CAST to varchar/nvarchar(max) e.g. on the delimiter, which would work around that. Or we can leave it as-is as a SQL Server quirk, like how we do with trailing whitespace.

@roji
Copy link
Member Author

roji commented Aug 31, 2022

Making this a draft as we're not doing this in 7.0.

@roji
Copy link
Member Author

roji commented Apr 4, 2023

Note: CONCAT_WS exists since SQL Server 2017 (14.x). We can use the compatibility level (#30163) to determine whether to translate or not (or to throw).

@roji roji changed the base branch from release/7.0 to main June 3, 2024 22:05
Comment on lines 245 to 251
arguments[i + 1] = sqlArgument switch
{
ColumnExpression { IsNullable: false } => sqlArgument,
SqlConstantExpression constantExpression => constantExpression.Value is null
? new SqlConstantExpression(string.Empty, stringTypeMapping)
: constantExpression,
_ => Dependencies.SqlExpressionFactory.Coalesce(
sqlArgument,
Dependencies.SqlExpressionFactory.Constant(string.Empty, typeof(string)))
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to eventually ensure that the nullability processor can simplify COALESCE by dropping everything after the first known-non-null subexpression (and all of the known-null subexpressions). This would make it possible to simply use

Suggested change
arguments[i + 1] = sqlArgument switch
{
ColumnExpression { IsNullable: false } => sqlArgument,
SqlConstantExpression constantExpression => constantExpression.Value is null
? new SqlConstantExpression(string.Empty, stringTypeMapping)
: constantExpression,
_ => Dependencies.SqlExpressionFactory.Coalesce(
sqlArgument,
Dependencies.SqlExpressionFactory.Constant(string.Empty, typeof(string)))
};
arguments[i + 1] = Dependencies.SqlExpressionFactory.Coalesce(
sqlArgument,
Dependencies.SqlExpressionFactory.Constant(string.Empty, typeof(string)));

here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranma42 thanks, that's an interesting direction... A couple of thoughts:

  • First, we can partially do this simplification at the source, i.e. improve SqlExpressionFactory.Coalesce() to not actually add the coalesce node for non-nullable columns/constants - I'm making that change in this PR. This has the small advantage of keeping the tree a bit cleaner, i.e. no unneeded coalesce node before the nullability processor.
  • But you're right that for the general case of arbitrary expressions, this would (currently) need to happen at the nullability processor, since that's the only place where we know the nullability of arbitrary expressions (anything beyond column/constant).
  • This general design is something we've discussed in the past; I've opened Consider introducing nullability on SqlExpression #33889 with some thoughts, but that's a very long-term, high-level architecture question that we can't really tackle in the near term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is value in keeping the expression tree as clean as possible; OTOH duplicating the null propagation/simplification logic across the codebase has its disadvantages.
I think that an interesting approach to manage some simple cases (that are nonetheless currently repeated/spread across several files) could be a wrapper (possibly even as a collection of extension methods) that abstracts the pattern you are writing and makes it easily re-usable in several places.
I am thinking about something like CoalesceAndSimplify (or other similar factory methods) that pre-emptively handles trivial optimizations (local&cheap ones). For an example of another compiler/expression translator that does this, see LLVM and how it performs trivial constant folding while building its AST.

A nit/question: is there a reason why one of the empty strings is built as new SqlConstantExpression(string.Empty, stringTypeMapping) and the other one as Dependencies.SqlExpressionFactory.Constant(string.Empty, typeof(string))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that an interesting approach to manage some simple cases (that are nonetheless currently repeated/spread across several files) could be a wrapper (possibly even as a collection of extension methods) that abstracts the pattern you are writing and makes it easily re-usable in several places.

Right - that's what I tried to introduce in this PR, within SqlExpressionFactory itself. SqlExpressionFactory already goes considerably beyond simply creating instances of expression types (it wouldn't be very useful if it did just that), so it seems OK to add this sort of simplification logic in there too; after all, part of the point is to not have to think about "do I need coalesce" at each and every callsite, but just have the simplification happen automatically. And yeah, that's a little bit similar to the LLVM IR builder-level optimization you're referencing - you could view SqlExpressionFactory as our "IR builder"... Let me know if this all makes sense to you.

Unfortnuately, we have various tests exercising Coalesce functionality, which are implemented over non-nullable columns; the Coalesce node would be stripped away there, and the tests would become useless. I've split this out to #33890.

A nit/question: is there a reason why one of the empty strings is built as new SqlConstantExpression(string.Empty, stringTypeMapping) and the other one as Dependencies.SqlExpressionFactory.Constant(string.Empty, typeof(string))?

Thanks - no, no reason - just that this is an old PR being revived, and there's a bit of mess. I cleaned it up.

@roji roji marked this pull request as ready for review June 4, 2024 07:09
@roji roji enabled auto-merge (squash) June 4, 2024 07:09
@roji roji disabled auto-merge June 4, 2024 07:45
@roji roji enabled auto-merge (squash) June 4, 2024 08:23
@roji
Copy link
Member Author

roji commented Jun 5, 2024

@dotnet/efteam this should be ready for review.

@roji roji merged commit 87796b9 into dotnet:main Jun 12, 2024
7 checks passed
@roji roji deleted the NonAggregateStringJoin branch June 12, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Server: translate non-aggregate string.Join to CONCAT_WS
6 participants