Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add lambda type checking without adding lambda sql type #6966

Merged
merged 6 commits into from
Feb 22, 2021

Conversation

lct45
Copy link
Contributor

@lct45 lct45 commented Feb 8, 2021

Description

Resolves unknown types in lambda's mainly through the ExpressionTypeManager and passes additional information in the SqlArgument to the UDF to resolve function type issues there, instead of adding a new SqlType.LAMBDA

The sequence of PR's to review is:

  1. feat: adding TypeContext for Lambdas #6989
  2. refactor: use SqlArgument wrapper to look up functions #7011
  3. This PR

Testing done

Describe the testing strategy. Unit and integration tests are expected for any behavior changes.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@lct45 lct45 requested review from agavra and stevenpyzhang February 8, 2021 14:06
@lct45 lct45 requested a review from a team as a code owner February 8, 2021 14:06
@ghost
Copy link

ghost commented Feb 8, 2021

@confluentinc It looks like @lct45 just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@guozhangwang
Copy link
Contributor

Both PRs with 1200+ LOC :P

Copy link
Member

@stevenpyzhang stevenpyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial thoughts

@lct45
Copy link
Contributor Author

lct45 commented Feb 9, 2021

@guozhangwang both PR are overly large, the hope was to pick the type checking approach we're going with and then I can break it out into manageable PRs

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a cleaner approach compared with #6967 to not introduce a LambdaType as the returned column type.

I just have one meta question for my own understanding: I cannot fully get the difference between the TypeContext and the ExpressionTypeContext within ExpressionTypeManager. More specifically, when does the generic -> sqlType population happens (inside TypeContext, before visitor.process(), or inside ExpressionTypeContext, after visitor.process())? It seems that it differs between UDFs and lambdas, but I'm not sure why it's the case. @lct45 could you elaborate a bit more to me?

final ExpressionTypeContext expressionTypeContext = new ExpressionTypeContext();
expressionTypeContext.setLambdaTypes(inputMapping);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: when we call setLambdaTypes here, is the mapping already populated or not? If yes, then we would not need to mapInputTypes anymore; if not, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping is already set here, but it's set in either CodeGenRunner or SqlToJavaVisitor. this is a way to pass the mapping we've already done in a TypeContext into a new ExpressionTypeContext that's used to resolve ultimately resolve the types within ExpressionTypeManager. ExpressionTypeManager is used by both SqlToJavaVisitor and CodeGenRunner to do type resolution.

I think this ties into what @stevenpyzhang commented earlier, that it may make sense to combine ExpressionTypeContext and TypeContext, as it would make passing this information infinitely easier. I think I'll go ahead and do that unless @agavra has any objections.

Currently, the difference between the contexts is one, ExpressionTypeContext, is just used in ExpressionTypeManager. This existed before lambdas for passing around sql return types. TypeContext, as the code stands now, is used by SqlToJavaVisitor and CodeGenRunner to store information relevant to lambdas and pass it to the relevant methods. It doesn't contain the basic sqlType storage that ExpressionTypeManager does. ExpressionTypeManager doesn't currently use TypeContext at all.

For lambdas, the generic -> sqlType population for the variables (x in x => ucase(x)) happens when the context (either TypeContext or ExpressionTypeContext) is passed information in the visitLambdaExpression function in ExpressionTypeManager, CodeGenRunner, and SqlToJavaVisitor. This happens after visitor.process() for each of the individual flows through one of the three relevant classes.

Let me know if this was more confusing than enlightening, I'm happy to have another go at explaining the relationship between all these pieces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at the code yet, but from your description combining the two makes sense to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lct45 , I think my confusion was that:

If in line 111 here, which is before we start visiting the node, the genericName -> sqlType inputMapping we are set via expressionTypeContex.setLambdaTypes() in is already fully populated, then inside visitLambdaExpression (line 200) which is part of the visiting traversal, we should not need to call mapInputTypes which is trying to build the mapping again.

Now I think I get your point, which is that the inputMapping passed in via getExpressionSqlType is not fully populated, i.e. it only contains the mapping entries for variables outside the lambda function, but not contain entries for variables that are only declared inside the lambda function. Is that right?

Copy link
Contributor

@guozhangwang guozhangwang Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, if we consolidate these two into a single TypeContext, would a single TypeContext object be used in multiple lambda expressions? I'm asking this because if that's the case, then its inputTypes and lambdaTypeMapping would be a global structure containing mappings for multiple expressions, and we cannot enforce inputTypes.size() == argumentList.size() in mapInputTypes.

EDIT: never mind, I saw your latest commit it is indeed one-per-expression.

Re-EDIT: actually I'm not very certain now, since one nested valueExpression could actually contain multiple primaryExpression, each being a separate lambda. In that case, the constructed TypeContext may be shared among multiple lambdas, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenpyzhang may be able to weigh in more here, but I think all the lambdas inside of a UDF should have the same number of inputs so I don't think there would be a concern if the TypeContext is shared between them. The example I'm thinking of is from the KLIP, transform_map(map, (k, v) => new_k, (k, v) => new_v), so having a global map of k,v should be fine. I did just run through it though, and the context is shared. This is necessary for the input though, since in a lambda it'll be one input mapped to the lambda variables (i.e. we only see the map once even if we have 2 lamda expressions). This does raise the issue of variable names. Should users be able to do a lambda like transform_map(map, (x,y) => x + y, (k,v) => k - v)? Because I think you're right @guozhangwang , right now I don't it would accept that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think transform_map(map, (x,y) => x + y, (k,v) => k - v) should be supported. We could just make a copy of the TypeContext and pass a fresh instance to each argument that we visit so that we can have different mappings for each Lambda node.

@stevenpyzhang stevenpyzhang self-assigned this Feb 10, 2021
Copy link
Member

@stevenpyzhang stevenpyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more nits

@lct45 lct45 force-pushed the lambda_type_fix branch 2 times, most recently from aec051f to e4ae803 Compare February 16, 2021 22:31
@lct45 lct45 requested a review from JimGalasyn as a code owner February 16, 2021 22:31
@lct45 lct45 force-pushed the lambda_type_fix branch 4 times, most recently from d2f2519 to 2905abb Compare February 17, 2021 20:44
@lct45 lct45 force-pushed the lambda_type_fix branch 2 times, most recently from a17e06b to 4264da9 Compare February 18, 2021 17:23

public SqlArgument(final SqlType type) {
public SqlArgument(final SqlType type, final SqlLambda lambda) {
Copy link
Member

@stevenpyzhang stevenpyzhang Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider changing the sqlType and sqlLambda variables to be Optionals so we don't need to do null checks where SqlArgument is used.

In the constructor we can use Optional.ofNullable

Copy link
Member

@stevenpyzhang stevenpyzhang Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm thinking about this more, we could actually have SqlArgument be a base class and two classes extend it SqlArgumentType (which has a SqlType) and SqlArgumentLambda(which has a SqlLambda). There are certain portions of the code such as the UDAF's where we now need to do some unnecessary null checks (they should only be supporting SqlArgumentType). The GenericsUtil can still take in SqlArgument and we can use instance of checks instead of null checks/optional checks when deciding what behavior to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea for the optional is to have the getter functions like this

  public SqlType getSqlTypeOrThrow() {
    if (sqlType.isPresent()) {
      return sqlType.get();
    }
    if (sqlLambda.isPresent()) {
      throw new RuntimeException("Was expecting sql type as an argument");
    }
    return null;
  }

  public Optional<SqlLambda> getSqlLambda() {
    return sqlLambda;
  }

  public SqlLambda getSqlLambdaOrThrow() {
    if (sqlLambda.isPresent()) {
      return sqlLambda.get();
    }
    throw new RuntimeException("Was expecting sql lambda as an argument");
  }

We return null from getSqlTypeOrThrow if both the type and lambda are missing because this is when the function argument is NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this makes the contract way clearer

@stevenpyzhang stevenpyzhang requested a review from a team February 18, 2021 19:26
SqlTypes.INTEGER);

// When:
final KsqlScalarFunction fun = FUNC_REG.getUdfFactory(FunctionName.of("reduce_map"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's just use some random generic function names for these tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to use an existing function name so it can find it in the function library, unless there's a way to mock that? When I changed it to lambda_func I got io.confluent.ksql.util.KsqlException: Can't find any functions with the name 'lambda_func'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really? That seems really odd since we'd break these tests if we changed the function name. It feels like we should have these functions mocked. If that's how it is then leaving this is fine then.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contents LGTM so giving the green check (though wait for Steven's +1 before merging), I left a spattering of nits here and there - nothing major. With regards to these two questions Steven's asked:

the usage of SqlArgument as it seems a little hacky with how it is currently.

I like your inline suggestion of making different classes for them and using instanceof checks (or using polymorphism to naturally handle it)

There's a lot of repeated code between SqlToJavaVisitor, CodeGenRunner, ExpressionTypeManager with adding types to the context for use in the child nodes to map lambda inputs to types, but we're not sure if there's any way around it.

Added a suggestion inline, maybe I'm missing something?

if (argument == null) {
return "null";
} else {
final SqlType sqlType = argument.getSqlType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I see @stevenpyzhang already commented about this below, I like his suggestion! Just keeping this comment here for historical purposes


I see the following method:

  public static SqlArgument of(final SqlType sqlType, final SqlLambda lambdaType) {
    return new SqlArgument(sqlType, lambdaType);
  }

What does it mean for a SqlArgument to have both a SqlType and a LambdaType? Can we update the docs to describe what this does? If it's not possible, we should enforce that in the code (make sure at most one of them is null).

assertThat(fun2.name(), equalTo(EXPECTED));
assertThat(e.getMessage(), containsString("Valid alternatives are:"
+ lineSeparator()
+ "expected(MAP<VARCHAR, VARCHAR>, LAMBDA<[VARCHAR, VARCHAR], A>)"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is shown to users, we might want to consider toString on lambdas to be something like (VARCHAR, VARCHAR) -> A instead of LAMBDA<[VARCHAR, VARCHAR], A> which is somewhat difficult to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should leave the LAMBDA identifier for the new toString? Just wondering if users will be able to easily identify if it's ((VARCHAR, VARCHAR) -> A

process(argExpr, context);
final SqlType newSqlType = expressionTypeManager.getExpressionSqlType(argExpr, context);
// for lambdas - if we see this it's the array/map being passed in we save the type
if (context.notAllInputsSeen()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not part of this PR, but I think I might be missing something

  public boolean notAllInputsSeen() {
    return lambdaInputTypeMapping.size() != lambdaInputTypes.size() || lambdaInputTypes.size() == 0;
  }

Can this handle lambdas with 0 arguments properly? If not that can be OK but we should make sure we fail that if anyone tries to register one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think it would register more input types than we want it to, but I think lambdas with 0 arguments would still fail. Would that be something like array_transform(testval1, 5 => 5)? or map_transform(MAP(5:=3, 1:=1), () => 2, () => 5)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked back into it, in LambdaFunctionCall we check the argument size
if (arguments.size() == 0) { throw new IllegalArgumentException( String.format("Lambda expression must have at least 1 argument. => %s", body.toString())); }


public SqlArgument(final SqlType type) {
public SqlArgument(final SqlType type, final SqlLambda lambda) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this makes the contract way clearer

@lct45 lct45 force-pushed the lambda_type_fix branch 3 times, most recently from 4db9976 to 5049d52 Compare February 19, 2021 20:37
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm late on getting a final pass! I took another look, and had a meta question regarding the code gen traversal itself (not related to this PR specifically).

Filed #7066 to summarize my thoughts.

if (argExpr instanceof LambdaFunctionCall) {
argumentTypes.add(
SqlArgument.of(
SqlLambda.of(context.getLambdaInputTypes(), childContext.getSqlType())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is cleaner to just use resolvedArgType as the second parameter, which will be the same as childContext.getSqlType(). Ditto on other two classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe we could consolidate the shared logic in FunctionCall:

List<SqlArgument> argumentTypes resolveArgumentTypes(TypeContext context)

shared among CodeGenRunner, SqlToJavaVisitor and ExpressionTypeManager?

And I'm even wondering, why we need to try to resolve the argument types multiple times, rather than just cache the resolved types inside the FunctionCall node itself after done for the first time, and then re-use it in later iterations of traversals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed @guozhangwang I think we can definitely optimize these 3 classes to avoid duplicate traversals. Thanks for filing the issue. I think for now since this was an existing issue even before lambdas, it's out of scope for the lambdas implementation and we can follow up on issue you filed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants