-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add lambda type checking without adding lambda sql type #6966
Conversation
@confluentinc It looks like @lct45 just signed our Contributor License Agreement. 👍 Always at your service, clabot |
ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java
Outdated
Show resolved
Hide resolved
ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java
Outdated
Show resolved
Hide resolved
ksqldb-common/src/main/java/io/confluent/ksql/function/UdfFactory.java
Outdated
Show resolved
Hide resolved
ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java
Outdated
Show resolved
Hide resolved
165896b
to
b9d1997
Compare
ksqldb-engine/src/main/java/io/confluent/ksql/function/FunctionLoaderUtils.java
Show resolved
Hide resolved
Both PRs with 1200+ LOC :P |
...p/src/main/java/io/confluent/ksql/rest/server/resources/streaming/StreamedQueryResource.java
Outdated
Show resolved
Hide resolved
ksqldb-udf/src/main/java/io/confluent/ksql/schema/ksql/types/SqlLambda.java
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial thoughts
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java
Outdated
Show resolved
Hide resolved
ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java
Outdated
Show resolved
Hide resolved
@guozhangwang both PR are overly large, the hope was to pick the type checking approach we're going with and then I can break it out into manageable PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a cleaner approach compared with #6967 to not introduce a LambdaType
as the returned column type.
I just have one meta question for my own understanding: I cannot fully get the difference between the TypeContext
and the ExpressionTypeContext
within ExpressionTypeManager
. More specifically, when does the generic -> sqlType population happens (inside TypeContext, before visitor.process(), or inside ExpressionTypeContext, after visitor.process())? It seems that it differs between UDFs and lambdas, but I'm not sure why it's the case. @lct45 could you elaborate a bit more to me?
final ExpressionTypeContext expressionTypeContext = new ExpressionTypeContext(); | ||
expressionTypeContext.setLambdaTypes(inputMapping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding: when we call setLambdaTypes here, is the mapping already populated or not? If yes, then we would not need to mapInputTypes
anymore; if not, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping is already set here, but it's set in either CodeGenRunner
or SqlToJavaVisitor
. this is a way to pass the mapping we've already done in a TypeContext
into a new ExpressionTypeContext
that's used to resolve ultimately resolve the types within ExpressionTypeManager
. ExpressionTypeManager
is used by both SqlToJavaVisitor
and CodeGenRunner
to do type resolution.
I think this ties into what @stevenpyzhang commented earlier, that it may make sense to combine ExpressionTypeContext
and TypeContext
, as it would make passing this information infinitely easier. I think I'll go ahead and do that unless @agavra has any objections.
Currently, the difference between the contexts is one, ExpressionTypeContext
, is just used in ExpressionTypeManager
. This existed before lambdas for passing around sql return types. TypeContext
, as the code stands now, is used by SqlToJavaVisitor
and CodeGenRunner
to store information relevant to lambdas and pass it to the relevant methods. It doesn't contain the basic sqlType
storage that ExpressionTypeManager
does. ExpressionTypeManager
doesn't currently use TypeContext
at all.
For lambdas, the generic -> sqlType population for the variables (x in x => ucase(x)) happens when the context (either TypeContext
or ExpressionTypeContext
) is passed information in the visitLambdaExpression
function in ExpressionTypeManager
, CodeGenRunner
, and SqlToJavaVisitor
. This happens after visitor.process()
for each of the individual flows through one of the three relevant classes.
Let me know if this was more confusing than enlightening, I'm happy to have another go at explaining the relationship between all these pieces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked at the code yet, but from your description combining the two makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lct45 , I think my confusion was that:
If in line 111 here, which is before we start visiting the node, the genericName -> sqlType inputMapping
we are set via expressionTypeContex.setLambdaTypes()
in is already fully populated, then inside visitLambdaExpression
(line 200) which is part of the visiting traversal, we should not need to call mapInputTypes
which is trying to build the mapping again.
Now I think I get your point, which is that the inputMapping
passed in via getExpressionSqlType
is not fully populated, i.e. it only contains the mapping entries for variables outside the lambda function, but not contain entries for variables that are only declared inside the lambda function. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, if we consolidate these two into a single TypeContext
, would a single TypeContext
object be used in multiple lambda expressions? I'm asking this because if that's the case, then its inputTypes
and lambdaTypeMapping
would be a global structure containing mappings for multiple expressions, and we cannot enforce inputTypes.size() == argumentList.size()
in mapInputTypes
.
EDIT: never mind, I saw your latest commit it is indeed one-per-expression.
Re-EDIT: actually I'm not very certain now, since one nested valueExpression
could actually contain multiple primaryExpression
, each being a separate lambda. In that case, the constructed TypeContext
may be shared among multiple lambdas, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenpyzhang may be able to weigh in more here, but I think all the lambdas inside of a UDF should have the same number of inputs so I don't think there would be a concern if the TypeContext is shared between them. The example I'm thinking of is from the KLIP, transform_map(map, (k, v) => new_k, (k, v) => new_v)
, so having a global map of k,v
should be fine. I did just run through it though, and the context is shared. This is necessary for the input though, since in a lambda it'll be one input mapped to the lambda variables (i.e. we only see the map once even if we have 2 lamda expressions). This does raise the issue of variable names. Should users be able to do a lambda like transform_map(map, (x,y) => x + y, (k,v) => k - v)
? Because I think you're right @guozhangwang , right now I don't it would accept that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think transform_map(map, (x,y) => x + y, (k,v) => k - v)
should be supported. We could just make a copy of the TypeContext and pass a fresh instance to each argument that we visit so that we can have different mappings for each Lambda node.
b4d0985
to
05b5ddc
Compare
ksqldb-engine/src/main/java/io/confluent/ksql/engine/rewrite/AstSanitizer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more nits
ksqldb-execution/src/main/java/io/confluent/ksql/execution/function/UdfUtil.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/helpers/TriFunction.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/helpers/LambdaUtil.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/test/java/io/confluent/ksql/execution/codegen/helpers/LambdaUtilTest.java
Outdated
Show resolved
Hide resolved
ksqldb-udf/src/main/java/io/confluent/ksql/function/TriFunction.java
Outdated
Show resolved
Hide resolved
ksqldb-engine/src/main/java/io/confluent/ksql/engine/EngineContext.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/TypeContext.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/TypeContext.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/TypeContext.java
Outdated
Show resolved
Hide resolved
aec051f
to
e4ae803
Compare
d2f2519
to
2905abb
Compare
2905abb
to
3ff3dca
Compare
ksqldb-common/src/main/java/io/confluent/ksql/function/types/ParamTypes.java
Show resolved
Hide resolved
a17e06b
to
4264da9
Compare
4264da9
to
c034b58
Compare
|
||
public SqlArgument(final SqlType type) { | ||
public SqlArgument(final SqlType type, final SqlLambda lambda) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider changing the sqlType and sqlLambda variables to be Optionals so we don't need to do null checks where SqlArgument is used.
In the constructor we can use Optional.ofNullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm thinking about this more, we could actually have SqlArgument be a base class and two classes extend it SqlArgumentType (which has a SqlType) and SqlArgumentLambda(which has a SqlLambda). There are certain portions of the code such as the UDAF's where we now need to do some unnecessary null checks (they should only be supporting SqlArgumentType). The GenericsUtil can still take in SqlArgument and we can use instance of
checks instead of null checks/optional checks when deciding what behavior to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea for the optional is to have the getter functions like this
public SqlType getSqlTypeOrThrow() {
if (sqlType.isPresent()) {
return sqlType.get();
}
if (sqlLambda.isPresent()) {
throw new RuntimeException("Was expecting sql type as an argument");
}
return null;
}
public Optional<SqlLambda> getSqlLambda() {
return sqlLambda;
}
public SqlLambda getSqlLambdaOrThrow() {
if (sqlLambda.isPresent()) {
return sqlLambda.get();
}
throw new RuntimeException("Was expecting sql lambda as an argument");
}
We return null from getSqlTypeOrThrow
if both the type and lambda are missing because this is when the function argument is NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this makes the contract way clearer
...p/src/main/java/io/confluent/ksql/rest/server/resources/streaming/StreamedQueryResource.java
Outdated
Show resolved
Hide resolved
SqlTypes.INTEGER); | ||
|
||
// When: | ||
final KsqlScalarFunction fun = FUNC_REG.getUdfFactory(FunctionName.of("reduce_map")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's just use some random generic function names for these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to use an existing function name so it can find it in the function library, unless there's a way to mock that? When I changed it to lambda_func
I got io.confluent.ksql.util.KsqlException: Can't find any functions with the name 'lambda_func'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh really? That seems really odd since we'd break these tests if we changed the function name. It feels like we should have these functions mocked. If that's how it is then leaving this is fine then.
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contents LGTM so giving the green check (though wait for Steven's +1 before merging), I left a spattering of nits here and there - nothing major. With regards to these two questions Steven's asked:
the usage of SqlArgument as it seems a little hacky with how it is currently.
I like your inline suggestion of making different classes for them and using instanceof
checks (or using polymorphism to naturally handle it)
There's a lot of repeated code between SqlToJavaVisitor, CodeGenRunner, ExpressionTypeManager with adding types to the context for use in the child nodes to map lambda inputs to types, but we're not sure if there's any way around it.
Added a suggestion inline, maybe I'm missing something?
ksqldb-common/src/main/java/io/confluent/ksql/function/GenericsUtil.java
Outdated
Show resolved
Hide resolved
ksqldb-common/src/main/java/io/confluent/ksql/function/GenericsUtil.java
Outdated
Show resolved
Hide resolved
if (argument == null) { | ||
return "null"; | ||
} else { | ||
final SqlType sqlType = argument.getSqlType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I see @stevenpyzhang already commented about this below, I like his suggestion! Just keeping this comment here for historical purposes
I see the following method:
public static SqlArgument of(final SqlType sqlType, final SqlLambda lambdaType) {
return new SqlArgument(sqlType, lambdaType);
}
What does it mean for a SqlArgument
to have both a SqlType
and a LambdaType
? Can we update the docs to describe what this does? If it's not possible, we should enforce that in the code (make sure at most one of them is null
).
assertThat(fun2.name(), equalTo(EXPECTED)); | ||
assertThat(e.getMessage(), containsString("Valid alternatives are:" | ||
+ lineSeparator() | ||
+ "expected(MAP<VARCHAR, VARCHAR>, LAMBDA<[VARCHAR, VARCHAR], A>)")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is shown to users, we might want to consider toString
on lambdas to be something like (VARCHAR, VARCHAR) -> A
instead of LAMBDA<[VARCHAR, VARCHAR], A>
which is somewhat difficult to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should leave the LAMBDA
identifier for the new toString
? Just wondering if users will be able to easily identify if it's ((VARCHAR, VARCHAR) -> A
ksqldb-common/src/test/java/io/confluent/ksql/function/UdfIndexTest.java
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java
Outdated
Show resolved
Hide resolved
process(argExpr, context); | ||
final SqlType newSqlType = expressionTypeManager.getExpressionSqlType(argExpr, context); | ||
// for lambdas - if we see this it's the array/map being passed in we save the type | ||
if (context.notAllInputsSeen()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not part of this PR, but I think I might be missing something
public boolean notAllInputsSeen() {
return lambdaInputTypeMapping.size() != lambdaInputTypes.size() || lambdaInputTypes.size() == 0;
}
Can this handle lambdas with 0 arguments properly? If not that can be OK but we should make sure we fail that if anyone tries to register one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think it would register more input types than we want it to, but I think lambdas with 0 arguments would still fail. Would that be something like array_transform(testval1, 5 => 5)
? or map_transform(MAP(5:=3, 1:=1), () => 2, () => 5)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked back into it, in LambdaFunctionCall
we check the argument size
if (arguments.size() == 0) { throw new IllegalArgumentException( String.format("Lambda expression must have at least 1 argument. => %s", body.toString())); }
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java
Outdated
Show resolved
Hide resolved
ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java
Outdated
Show resolved
Hide resolved
|
||
public SqlArgument(final SqlType type) { | ||
public SqlArgument(final SqlType type, final SqlLambda lambda) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this makes the contract way clearer
93b7479
to
a907410
Compare
4db9976
to
5049d52
Compare
5049d52
to
6a2e65d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm late on getting a final pass! I took another look, and had a meta question regarding the code gen traversal itself (not related to this PR specifically).
Filed #7066 to summarize my thoughts.
if (argExpr instanceof LambdaFunctionCall) { | ||
argumentTypes.add( | ||
SqlArgument.of( | ||
SqlLambda.of(context.getLambdaInputTypes(), childContext.getSqlType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this is cleaner to just use resolvedArgType
as the second parameter, which will be the same as childContext.getSqlType()
. Ditto on other two classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe we could consolidate the shared logic in FunctionCall
:
List<SqlArgument> argumentTypes resolveArgumentTypes(TypeContext context)
shared among CodeGenRunner
, SqlToJavaVisitor
and ExpressionTypeManager
?
And I'm even wondering, why we need to try to resolve the argument types multiple times, rather than just cache the resolved types inside the FunctionCall
node itself after done for the first time, and then re-use it in later iterations of traversals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed @guozhangwang I think we can definitely optimize these 3 classes to avoid duplicate traversals. Thanks for filing the issue. I think for now since this was an existing issue even before lambdas, it's out of scope for the lambdas implementation and we can follow up on issue you filed in the future.
Description
Resolves unknown types in lambda's mainly through the
ExpressionTypeManager
and passes additional information in the SqlArgument to the UDF to resolve function type issues there, instead of adding a newSqlType.LAMBDA
The sequence of PR's to review is:
Testing done
Describe the testing strategy. Unit and integration tests are expected for any behavior changes.
Reviewer checklist