feat: add lambda type checking without adding lambda sql type #6966

lct45 · 2021-02-08T14:06:54Z

Description

Resolves unknown types in lambda's mainly through the ExpressionTypeManager and passes additional information in the SqlArgument to the UDF to resolve function type issues there, instead of adding a new SqlType.LAMBDA

The sequence of PR's to review is:

Testing done

Describe the testing strategy. Unit and integration tests are expected for any behavior changes.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

ghost · 2021-02-08T14:06:56Z

@confluentinc It looks like @lct45 just signed our Contributor License Agreement. 👍

Always at your service,

clabot

ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java

ksqldb-common/src/main/java/io/confluent/ksql/function/UdfFactory.java

ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java

ksqldb-engine/src/main/java/io/confluent/ksql/function/FunctionLoaderUtils.java

guozhangwang · 2021-02-09T05:20:51Z

Both PRs with 1200+ LOC :P

...p/src/main/java/io/confluent/ksql/rest/server/resources/streaming/StreamedQueryResource.java

ksqldb-udf/src/main/java/io/confluent/ksql/schema/ksql/SqlArgument.java

ksqldb-udf/src/main/java/io/confluent/ksql/schema/ksql/types/SqlLambda.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java

stevenpyzhang

Left some initial thoughts

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java

ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java

lct45 · 2021-02-09T14:25:27Z

@guozhangwang both PR are overly large, the hope was to pick the type checking approach we're going with and then I can break it out into manageable PRs

guozhangwang

I think this is a cleaner approach compared with #6967 to not introduce a LambdaType as the returned column type.

I just have one meta question for my own understanding: I cannot fully get the difference between the TypeContext and the ExpressionTypeContext within ExpressionTypeManager. More specifically, when does the generic -> sqlType population happens (inside TypeContext, before visitor.process(), or inside ExpressionTypeContext, after visitor.process())? It seems that it differs between UDFs and lambdas, but I'm not sure why it's the case. @lct45 could you elaborate a bit more to me?

guozhangwang · 2021-02-10T00:16:35Z

ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java

    final ExpressionTypeContext expressionTypeContext = new ExpressionTypeContext();
+    expressionTypeContext.setLambdaTypes(inputMapping);


For my own understanding: when we call setLambdaTypes here, is the mapping already populated or not? If yes, then we would not need to mapInputTypes anymore; if not, why?

The mapping is already set here, but it's set in either CodeGenRunner or SqlToJavaVisitor. this is a way to pass the mapping we've already done in a TypeContext into a new ExpressionTypeContext that's used to resolve ultimately resolve the types within ExpressionTypeManager. ExpressionTypeManager is used by both SqlToJavaVisitor and CodeGenRunner to do type resolution.

I think this ties into what @stevenpyzhang commented earlier, that it may make sense to combine ExpressionTypeContext and TypeContext, as it would make passing this information infinitely easier. I think I'll go ahead and do that unless @agavra has any objections.

Currently, the difference between the contexts is one, ExpressionTypeContext, is just used in ExpressionTypeManager. This existed before lambdas for passing around sql return types. TypeContext, as the code stands now, is used by SqlToJavaVisitor and CodeGenRunner to store information relevant to lambdas and pass it to the relevant methods. It doesn't contain the basic sqlType storage that ExpressionTypeManager does. ExpressionTypeManager doesn't currently use TypeContext at all.

For lambdas, the generic -> sqlType population for the variables (x in x => ucase(x)) happens when the context (either TypeContext or ExpressionTypeContext) is passed information in the visitLambdaExpression function in ExpressionTypeManager, CodeGenRunner, and SqlToJavaVisitor. This happens after visitor.process() for each of the individual flows through one of the three relevant classes.

Let me know if this was more confusing than enlightening, I'm happy to have another go at explaining the relationship between all these pieces

Haven't looked at the code yet, but from your description combining the two makes sense to me!

Thanks @lct45 , I think my confusion was that:

If in line 111 here, which is before we start visiting the node, the genericName -> sqlType inputMapping we are set via expressionTypeContex.setLambdaTypes() in is already fully populated, then inside visitLambdaExpression (line 200) which is part of the visiting traversal, we should not need to call mapInputTypes which is trying to build the mapping again.

Now I think I get your point, which is that the inputMapping passed in via getExpressionSqlType is not fully populated, i.e. it only contains the mapping entries for variables outside the lambda function, but not contain entries for variables that are only declared inside the lambda function. Is that right?

In that case, if we consolidate these two into a single TypeContext, would a single TypeContext object be used in multiple lambda expressions? I'm asking this because if that's the case, then its inputTypes and lambdaTypeMapping would be a global structure containing mappings for multiple expressions, and we cannot enforce inputTypes.size() == argumentList.size() in mapInputTypes.

EDIT: never mind, I saw your latest commit it is indeed one-per-expression.

Re-EDIT: actually I'm not very certain now, since one nested valueExpression could actually contain multiple primaryExpression, each being a separate lambda. In that case, the constructed TypeContext may be shared among multiple lambdas, is that right?

@stevenpyzhang may be able to weigh in more here, but I think all the lambdas inside of a UDF should have the same number of inputs so I don't think there would be a concern if the TypeContext is shared between them. The example I'm thinking of is from the KLIP, transform_map(map, (k, v) => new_k, (k, v) => new_v), so having a global map of k,v should be fine. I did just run through it though, and the context is shared. This is necessary for the input though, since in a lambda it'll be one input mapped to the lambda variables (i.e. we only see the map once even if we have 2 lamda expressions). This does raise the issue of variable names. Should users be able to do a lambda like transform_map(map, (x,y) => x + y, (k,v) => k - v)? Because I think you're right @guozhangwang , right now I don't it would accept that

I think transform_map(map, (x,y) => x + y, (k,v) => k - v) should be supported. We could just make a copy of the TypeContext and pass a fresh instance to each argument that we visit so that we can have different mappings for each Lambda node.

ksqldb-engine/src/main/java/io/confluent/ksql/engine/rewrite/AstSanitizer.java

stevenpyzhang

some more nits

ksqldb-execution/src/main/java/io/confluent/ksql/execution/function/UdfUtil.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/helpers/TriFunction.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/helpers/LambdaUtil.java

ksqldb-execution/src/test/java/io/confluent/ksql/execution/codegen/helpers/LambdaUtilTest.java

ksqldb-udf/src/main/java/io/confluent/ksql/function/TriFunction.java

ksqldb-engine/src/main/java/io/confluent/ksql/engine/EngineContext.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/TypeContext.java

ksqldb-common/src/main/java/io/confluent/ksql/function/types/ParamTypes.java

stevenpyzhang · 2021-02-18T19:23:01Z

ksqldb-udf/src/main/java/io/confluent/ksql/schema/ksql/SqlArgument.java


-  public SqlArgument(final SqlType type) {
+  public SqlArgument(final SqlType type, final SqlLambda lambda) {


We should consider changing the sqlType and sqlLambda variables to be Optionals so we don't need to do null checks where SqlArgument is used.

In the constructor we can use Optional.ofNullable

Now that I'm thinking about this more, we could actually have SqlArgument be a base class and two classes extend it SqlArgumentType (which has a SqlType) and SqlArgumentLambda(which has a SqlLambda). There are certain portions of the code such as the UDAF's where we now need to do some unnecessary null checks (they should only be supporting SqlArgumentType). The GenericsUtil can still take in SqlArgument and we can use instance of checks instead of null checks/optional checks when deciding what behavior to use.

One idea for the optional is to have the getter functions like this

public SqlType getSqlTypeOrThrow() { if (sqlType.isPresent()) { return sqlType.get(); } if (sqlLambda.isPresent()) { throw new RuntimeException("Was expecting sql type as an argument"); } return null; } public Optional<SqlLambda> getSqlLambda() { return sqlLambda; } public SqlLambda getSqlLambdaOrThrow() { if (sqlLambda.isPresent()) { return sqlLambda.get(); } throw new RuntimeException("Was expecting sql lambda as an argument"); }

We return null from getSqlTypeOrThrow if both the type and lambda are missing because this is when the function argument is NULL

+1 this makes the contract way clearer

...p/src/main/java/io/confluent/ksql/rest/server/resources/streaming/StreamedQueryResource.java

stevenpyzhang · 2021-02-18T20:10:02Z

ksqldb-engine/src/test/java/io/confluent/ksql/function/UdfLoaderTest.java

+            SqlTypes.INTEGER);
+
+    // When:
+    final KsqlScalarFunction fun = FUNC_REG.getUdfFactory(FunctionName.of("reduce_map"))


nit: Let's just use some random generic function names for these tests

I think you have to use an existing function name so it can find it in the function library, unless there's a way to mock that? When I changed it to lambda_func I got io.confluent.ksql.util.KsqlException: Can't find any functions with the name 'lambda_func'

Oh really? That seems really odd since we'd break these tests if we changed the function name. It feels like we should have these functions mocked. If that's how it is then leaving this is fine then.

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java

agavra

Contents LGTM so giving the green check (though wait for Steven's +1 before merging), I left a spattering of nits here and there - nothing major. With regards to these two questions Steven's asked:

the usage of SqlArgument as it seems a little hacky with how it is currently.

I like your inline suggestion of making different classes for them and using instanceof checks (or using polymorphism to naturally handle it)

There's a lot of repeated code between SqlToJavaVisitor, CodeGenRunner, ExpressionTypeManager with adding types to the context for use in the child nodes to map lambda inputs to types, but we're not sure if there's any way around it.

Added a suggestion inline, maybe I'm missing something?

ksqldb-common/src/main/java/io/confluent/ksql/function/GenericsUtil.java

agavra · 2021-02-19T01:35:23Z

ksqldb-common/src/main/java/io/confluent/ksql/function/UdfIndex.java

+          if (argument == null) {
+            return "null";
+          } else {
+            final SqlType sqlType = argument.getSqlType();


EDIT: I see @stevenpyzhang already commented about this below, I like his suggestion! Just keeping this comment here for historical purposes

I see the following method:

public static SqlArgument of(final SqlType sqlType, final SqlLambda lambdaType) { return new SqlArgument(sqlType, lambdaType); }

What does it mean for a SqlArgument to have both a SqlType and a LambdaType? Can we update the docs to describe what this does? If it's not possible, we should enforce that in the code (make sure at most one of them is null).

agavra · 2021-02-19T03:23:32Z

ksqldb-common/src/test/java/io/confluent/ksql/function/UdfIndexTest.java

+    assertThat(fun2.name(), equalTo(EXPECTED));
+    assertThat(e.getMessage(), containsString("Valid alternatives are:"
+        + lineSeparator()
+        + "expected(MAP<VARCHAR, VARCHAR>, LAMBDA<[VARCHAR, VARCHAR], A>)"));


if this is shown to users, we might want to consider toString on lambdas to be something like (VARCHAR, VARCHAR) -> A instead of LAMBDA<[VARCHAR, VARCHAR], A> which is somewhat difficult to read

Do you think we should leave the LAMBDA identifier for the new toString? Just wondering if users will be able to easily identify if it's ((VARCHAR, VARCHAR) -> A

ksqldb-common/src/test/java/io/confluent/ksql/function/UdfIndexTest.java

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java

agavra · 2021-02-19T03:41:15Z

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java

+        process(argExpr, context);
+        final SqlType newSqlType = expressionTypeManager.getExpressionSqlType(argExpr, context);
+        // for lambdas - if we see this it's the  array/map being passed in we save the type
+        if (context.notAllInputsSeen()) {


not part of this PR, but I think I might be missing something

public boolean notAllInputsSeen() { return lambdaInputTypeMapping.size() != lambdaInputTypes.size() || lambdaInputTypes.size() == 0; }

Can this handle lambdas with 0 arguments properly? If not that can be OK but we should make sure we fail that if anyone tries to register one.

Hmm... I think it would register more input types than we want it to, but I think lambdas with 0 arguments would still fail. Would that be something like array_transform(testval1, 5 => 5)? or map_transform(MAP(5:=3, 1:=1), () => 2, () => 5)?

Just looked back into it, in LambdaFunctionCall we check the argument size
if (arguments.size() == 0) { throw new IllegalArgumentException( String.format("Lambda expression must have at least 1 argument. => %s", body.toString())); }

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java

agavra · 2021-02-19T03:52:31Z

ksqldb-udf/src/main/java/io/confluent/ksql/schema/ksql/SqlArgument.java


-  public SqlArgument(final SqlType type) {
+  public SqlArgument(final SqlType type, final SqlLambda lambda) {


+1 this makes the contract way clearer

guozhangwang

Sorry I'm late on getting a final pass! I took another look, and had a meta question regarding the code gen traversal itself (not related to this PR specifically).

Filed #7066 to summarize my thoughts.

guozhangwang · 2021-02-22T23:15:53Z

ksqldb-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenRunner.java

+        if (argExpr instanceof LambdaFunctionCall) {
+          argumentTypes.add(
+              SqlArgument.of(
+                  SqlLambda.of(context.getLambdaInputTypes(), childContext.getSqlType())));


nit: I think this is cleaner to just use resolvedArgType as the second parameter, which will be the same as childContext.getSqlType(). Ditto on other two classes.

Also maybe we could consolidate the shared logic in FunctionCall:

List<SqlArgument> argumentTypes resolveArgumentTypes(TypeContext context)

shared among CodeGenRunner, SqlToJavaVisitor and ExpressionTypeManager?

And I'm even wondering, why we need to try to resolve the argument types multiple times, rather than just cache the resolved types inside the FunctionCall node itself after done for the first time, and then re-use it in later iterations of traversals?

Agreed @guozhangwang I think we can definitely optimize these 3 classes to avoid duplicate traversals. Thanks for filing the issue. I think for now since this was an existing issue even before lambdas, it's out of scope for the lambdas implementation and we can follow up on issue you filed in the future.

lct45 requested review from agavra and stevenpyzhang February 8, 2021 14:06

lct45 requested a review from a team as a code owner February 8, 2021 14:06