Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37019][SQL] Add codegen support to array transform #34294

Closed
wants to merge 2 commits into from

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Oct 15, 2021

What changes were proposed in this pull request?

This PR adds codegen support to ArrayTransform. This is my first time playing around with codegen, so definitely looking for any feedback. I ran into several issues along the way which you'll see in some checks I had to add. Specifically:

  • I initially added lambda variable tracking to the codegen context, to make sure a function can't be split out while there is any active lambda variables and using local project values for the lambda variables. But this lead to a function to large issue when testing it in production code so I changed it to just use everything off of the reference to the atomic variable
  • Made sure lambda functions themselves can never be considered for subexpression elimination

Questions I have:

  • Does it make sense to only use the reference to the atomic variable or should I try to add a mutable state or something for less function calls in the happy non-fallback case?
  • Any other weird corner cases I'm not thinking of?

Why are the changes needed?

To improve performance of transform operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

@github-actions github-actions bot added the SQL label Oct 15, 2021
@@ -338,6 +390,68 @@ case class ArrayTransform(
result
}

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this can probably be abstracted out into the parent traits, but I figured that will be easier to do when implementing a second function

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Kimahriman
Copy link
Contributor Author

I have array filter working already too, just waiting for some feedback on this.

@Kimahriman
Copy link
Contributor Author

Kimahriman commented Oct 22, 2021

Codegen example for df.selectExpr("transform(i, (x, i) -> x + i)"):

/* 038 */       if (!localtablescan_isNull_0) {
/* 039 */         final int project_numElements_0 = localtablescan_value_0.numElements();
/* 040 */
/* 041 */         ArrayData project_arrayData_0 = ArrayData.allocateArrayData(
/* 042 */           4, project_numElements_0, " transform failed.");
/* 043 */
/* 044 */         for (int project_i_0 = 0; project_i_0 < project_numElements_0; project_i_0++) {
/* 045 */           int lambda_x_6 = localtablescan_value_0.getInt(project_i_0);
/* 046 */           boolean lambda_x_6_isNull = localtablescan_value_0.isNullAt(project_i_0);
/* 047 */           ((java.util.concurrent.atomic.AtomicReference) references[1] /* x_6 */).set(lambda_x_6);
/* 048 */
/* 049 */           int lambda_i_7 = project_i_0;
/* 050 */           ((java.util.concurrent.atomic.AtomicReference) references[2] /* i_7 */).set(lambda_i_7);
/* 051 */
/* 052 */           boolean project_isNull_3 = true;
/* 053 */           int project_value_3 = -1;
/* 054 */
/* 055 */           if (!lambda_x_6_isNull) {
/* 056 */             project_isNull_3 = false; // resultCode could change nullability.
/* 057 */
/* 058 */             project_value_3 = lambda_x_6 + lambda_i_7;
/* 059 */
/* 060 */           }
/* 061 */           boolean project_isNull_2 = project_isNull_3;
/* 062 */           int project_value_2 = project_value_3;
/* 063 */
/* 064 */           if (project_isNull_2) {
/* 065 */             project_arrayData_0.setNullAt(project_i_0);
/* 066 */           } else {
/* 067 */             project_arrayData_0.setInt(project_i_0, project_value_2);
/* 068 */           }
/* 069 */
/* 070 */         }
/* 071 */         project_value_0 = project_arrayData_0;
/* 072 */
/* 073 */       }

@Kimahriman
Copy link
Contributor Author

Already ran into an issue testing this out in a real use case where making it non-splittable caused a function to grow to big and revert back to interpreted, so I changed to just always use the atomic variable reference to read the value so it can be splittable. New sample codegen:

/* 039 */         final int project_numElements_0 = localtablescan_value_0.numElements();
/* 040 */
/* 041 */         ArrayData project_arrayData_0 = ArrayData.allocateArrayData(
/* 042 */           4, project_numElements_0, " transform failed.");
/* 043 */
/* 044 */         for (int project_i_0 = 0; project_i_0 < project_numElements_0; project_i_0++) {
/* 045 */           if (localtablescan_value_0.isNullAt(project_i_0)) {
/* 046 */             ((java.util.concurrent.atomic.AtomicReference) references[2] /* x_229 */).set(null);
/* 047 */           } else {
/* 048 */             ((java.util.concurrent.atomic.AtomicReference) references[2] /* x_229 */).set(localtablescan_value_0.getInt(project_i_0));
/* 049 */           }
/* 050 */
/* 051 */           boolean project_isNull_3 = true;
/* 052 */           int project_value_3 = -1;
/* 053 */           Object project_tmpAtomic_0 = ((java.util.concurrent.atomic.AtomicReference) references[1] /* x_229 */).get();
/* 054 */           int project_value_4 = -1;
/* 055 */           boolean project_isNull_4 = project_tmpAtomic_0 == null;
/* 056 */           if (!project_isNull_4) {
/* 057 */             project_value_4 = (Integer)project_tmpAtomic_0;
/* 058 */           }
/* 059 */           if (!project_isNull_4) {
/* 060 */             project_isNull_3 = false; // resultCode could change nullability.
/* 061 */
/* 062 */             project_value_3 = project_value_4 + 1;
/* 063 */
/* 064 */           }
/* 065 */           boolean project_isNull_2 = project_isNull_3;
/* 066 */           int project_value_2 = project_value_3;
/* 067 */
/* 068 */           if (project_isNull_2) {
/* 069 */             project_arrayData_0.setNullAt(project_i_0);
/* 070 */           } else {
/* 071 */             project_arrayData_0.setInt(project_i_0, project_value_2);
/* 072 */           }
/* 073 */
/* 074 */         }
/* 075 */         project_value_0 = project_arrayData_0;
/* 076 */
/* 077 */       }

@Kimahriman
Copy link
Contributor Author

ping @viirya @HyukjinKwon since you two usually look at my PRs or might know the right other people to ping for thoughts :)

@Kimahriman
Copy link
Contributor Author

Closing in favor of #34558

@Kimahriman Kimahriman closed this Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants