Add Scala API for optimize #961

Kimahriman · 2022-02-24T12:34:47Z

Resolves #960

Add functions to DeltaTable to perform optimization. Lightweight wrapper that just calls the OptimizeExecutor directly. All the commands seem to be implemented slightly differently (some return an empty DF, some return the actual result of something), so wasn't sure which route to go. Just had to move the partition condition checking into the executor. Also added a few duplicate tests using the DeltaTable API instead of SQL.

vkorukanti

@Kimahriman These APIs are useful and thank you for the PR

PR looks good. Left few comments.

vkorukanti · 2022-02-25T07:38:32Z

core/src/main/scala/io/delta/tables/DeltaTable.scala

+   *
+   * @since 1.2.0
+   */
+  def optimize(condition: String): Unit = {


SQL equivalent of Optimize returns DataFrame containing the operation metrics. We can do the same here (look at the vacuum for examples).

Do you know offhand the best way to convert the Seq[Row] to a DataFrame? I mostly only use the Python API. There's a createDataFrame(rows: List[Row], schema: [StructType]), not sure if there's a way to use the output to create that schema or if I should just manually define that in the class as well

And it looks like executeVacuum also just returns sparkSession.emptyDataFrame

It can be done similar to executeGenerate here. Instead of using the OptimizeExecutor, use the DeltaCommand implementation OptimizeTableCommand. This also removes the changes where the partition filter has is moved from OptimizeTableCommand to OptimizeExecutor.

For vacuum API, please create an issue.

I didn't try to go that route initially because it seemed weird to call a function on DeltaTable that then has to generate the table ID to pass to another function just to it can re-look up the DeltaLog again. And same for the expression, I wanted to add passing a Column expression and it seemed weird to have to convert the Column back to SQL just so it can be parsed into an expression again

Got that route working by creating the table identifier like the generate command does

vkorukanti · 2022-02-25T07:39:14Z

core/src/main/scala/io/delta/tables/DeltaTable.scala

+  }
+
+  /**
+   * Optimize data in the table that match the given `condition`.


Specify that condition should only contain the filters on partition columns, otherwise a [[AnalysisException]] is thrown. Same comment for the next API.

Added to the comment

vkorukanti · 2022-02-25T07:48:51Z

core/src/test/scala/org/apache/spark/sql/delta/optimize/OptimizeCompactionSuite.scala

+      assert(fileListBefore.count(_.partitionValues === Map("id" -> "0")) > 1)
+
+      val versionBefore = deltaLogBefore.snapshot.version
+      io.delta.tables.DeltaTable.forPath(spark, path).optimize("id = 0")


One suggestion to minimize the test code duplication.

Add an abstract method executeOptimize(path, condition) to OptimizeCompactionSuiteBase and modify all tests to call this method rather than invoking the SQL or Scala API directly.

Rename OptimizeCompactionSuite to OptimizeCompactionSQLSuite and implement executeOptimize(path, condition) that issues a Optimize SQL command.

Add OptimizeCompactionScalaSuite extending OptimizeCompactionSuite and implement executeOptimize(path, condition) that issues a Optimize Scala API call.

I'll try that out

Got that working, made two funcs to handle path or table

vkorukanti · 2022-03-22T18:18:33Z

Hi @Kimahriman, updated changes look good. I am just thinking through the updates to these APIs when we add Z-order (#920). Give me a couple of days to think this through.

vkorukanti · 2022-04-01T19:52:24Z

HI @Kimahriman,

Given that we have plan to support two types of optimization (file compaction and Z-Order) and may be more, here is the an API proposal based on builder pattern (similar to Merge APIs). It helps the construction of optimize operation easier, less number of APIs and extensible. Here is the POC (based on top of this PR). Let me know what you think.

  /**
   * Optimize the data layout of the table. This returns
   * a [[DeltaOptimizeBuilder]] object that can be used to specify
   * the partition filter to limit the scope of optimize and
   * also execute different optimization techniques such as file
   * compaction or order data using Z-Order curves.
   *
   * See the [[DeltaOptimizeBuilder]] for a full description
   * of this operation.
   *
   * Scala example to run file compaction on a subset of
   * partitions in the table:
   * {{{
   *    deltaTable
   *     .optimize()
   *     .withPartitionFilter("date='2021-11-18')
   *     .executeCompaction();
   * }}}
   *
   * Scala example to Z-Order data using given columns on a
   * subset of partitions in the table:
   * {{{
   *    deltaTable
   *     .optimize()
   *     .withPartitionFilter("date='2021-11-18')
   *     .executeZOrderBy("city", ");
   * }}}
   *
   * @since 1.3.0
   */
  def optimize(): DeltaOptimizeBuilder

/**
 * Builder class for constructing OPTIMIZE command and executing.
 *
 * @param sparkSession [[SparkSession]] to use for execution
 * @param tableIdentifier Id of the table on which to
 *        execute the optimize
 * @param partitionFilter Optional partition filter.
 */
class DeltaOptimizeBuilder(
    sparkSession: SparkSession,
    tableIdentifier: TableIdentifier,
    partitionFilter: Option[String] = None)

  /**
   * Apply partition filter on this optimize command builder to limi
   * the operation on selected partitions.
   * @return [[DeltaOptimizeBuilder]] with partition filter applied
   */
  def withPartitionFilter
      (partitionFilter: String): DeltaOptimizeBuilder

  /**
   * Z-Order the data in the table using the given columns.
   * @param columns Zero or more columns to order the data
   *                using Z-Order curves
   * @return [[DataFrame]] containing the OPTIMIZE execution metrics
   */
  def executeZOrderBy(columns: String *): DataFrame

  /**
   * Compact the small files in selected partitions.
   * @return [[DataFrame]] containing the OPTIMIZE execution metrics
   */
  def executeCompaction(): DataFrame

Kimahriman · 2022-04-02T09:29:37Z

Works for me. If you want to keep working it that's fine. I won't be able to work on this for a couple more weeks

Kimahriman · 2022-04-13T14:53:33Z

Got it working, let me know how it looks

Kimahriman · 2022-04-13T15:45:02Z

Not sure how the docs are supposed to work

vkorukanti

LGTM. Minor comments. Thanks for adding the APIs.

vkorukanti · 2022-04-13T15:52:10Z

core/src/main/scala/io/delta/tables/DeltaOptimizeBuilder.scala

+ *
+ * @param sparkSession [[SparkSession]] to use for execution
+ * @param tableIdentifier Id of the table on which to
+ *        execute the optimize


@since 1.3.0

vkorukanti · 2022-04-13T15:53:18Z

core/src/test/scala/org/apache/spark/sql/delta/optimize/OptimizeCompactionSuite.scala

      deltaLog.update()
      assert(deltaLog.snapshot.version === versionBeforeOptimize + 1)
      checkDatasetUnorderly(data.toDF().as[Int], 1, 2, 3, 4, 5, 6)
-
-      // Make sure thread pool is shut down


why remove this check?

Ah missed that in the rebase

vkorukanti · 2022-04-13T15:55:58Z

core/src/main/scala/io/delta/tables/DeltaTable.scala

+   *    deltaTable
+   *     .optimize()
+   *     .partitionFilter("date='2021-11-18')
+   *     .executeZOrderBy("city", ");


nit: Extra ".

vkorukanti · 2022-04-13T16:00:22Z

Not sure how the docs are supposed to work

Check the readme here.

Kimahriman · 2022-04-13T17:49:31Z

I'm not sure how I'm supposed to fix

/.../delta/core/target/java/io/delta/tables/DeltaOptimizeBuilder.java:14:1:  error: reference not found
[error]    * @return {@link DataFrame} containing the OPTIMIZE execution metrics

I see other places use [[DataFrame]] in the docs but don't know what makes those work or what's different about this

vkorukanti · 2022-04-19T16:37:42Z

I'm not sure how I'm supposed to fix
/.../delta/core/target/java/io/delta/tables/DeltaOptimizeBuilder.java:14:1:  error: reference not found
[error]    * @return {@link DataFrame} containing the OPTIMIZE execution metrics
I see other places use [[DataFrame]] in the docs but don't know what makes those work or what's different about this

Looking at the docs, it looks like this never worked for the APIs we are publishing API docs. It must be because the referenced class is another jar (Spark in this case) for which we are not generating any docs.

vkorukanti · 2022-04-20T16:17:46Z

core/src/main/scala/io/delta/tables/DeltaOptimizeBuilder.scala

+   *                using Z-Order curves
+   * @return DataFrame containing the OPTIMIZE execution metrics
+   */
+  def executeZOrderBy(columns: String *): DataFrame = {


Hey @Kimahriman, given that this API is not yet supported I think it is better to remove it. No need to make any changes, I will remove this before I put it into the merge queue.

Yeah that's fine, wasn't sure whether to include it or not

vkorukanti · 2022-04-28T16:51:30Z

core/src/main/scala/io/delta/tables/DeltaOptimizeBuilder.scala

+   * @param partitionFilter The partition filter to apply
+   * @return [[DeltaOptimizeBuilder]] with partition filter applied
+   */
+  def partitionFilter(partitionFilter: String): DeltaOptimizeBuilder = {


Hi @Kimahriman, had an offline conversation with @tdas. He mentioned one very good point about keeping the names same as SQL. In SQL we have where to select partitions. Renaming this method to def where(partitionFilter: String) to keep it sync with the SQL.

We follow this pattern in other APIs as well. For example in Merge: SQL has WHEN MATCHED and in Scala/Python we have similar named method whenMatched.

Let me know if there are any concerns with the rename. I can make locally change and put it into the merge queue.

I have no problem with that, just let me know if you want me to change anything or if you'll handle it

Thanks @Kimahriman, its ok I can make the change.

@SInCE

Add functions to `DeltaTable` to perform optimization. API documentation: ``` /** * Optimize the data layout of the table. This returns * a [[DeltaOptimizeBuilder]] object that can be used to specify * the partition filter to limit the scope of optimize and * also execute different optimization techniques such as file * compaction or order data using Z-Order curves. * * See the [[DeltaOptimizeBuilder]] for a full description * of this operation. * * Scala example to run file compaction on a subset of * partitions in the table: * {{{ * deltaTable * .optimize() * .where("date='2021-11-18'") * .executeCompaction(); * }}} * * @SInCE 1.3.0 */ def optimize(): DeltaOptimizeBuilder ``` ``` /** * Builder class for constructing OPTIMIZE command and executing. * * @param sparkSession SparkSession to use for execution * @param tableIdentifier Id of the table on which to * execute the optimize * @SInCE 1.3.0 */ class DeltaOptimizeBuilder( sparkSession: SparkSession, tableIdentifier: String) extends AnalysisHelper { /** * Apply partition filter on this optimize command builder to limit * the operation on selected partitions. * @param partitionFilter The partition filter to apply * @return [[DeltaOptimizeBuilder]] with partition filter applied */ def where(partitionFilter: String): DeltaOptimizeBuilder /** * Compact the small files in selected partitions. * @return DataFrame containing the OPTIMIZE execution metrics */ def executeCompaction(): DataFrame } ``` Closes delta-io#961 Fixes delta-io#960 Signed-off-by: Venki Korukanti <[email protected]> GitOrigin-RevId: 615e215b96fb9e9b9223d3d2b429dc18dff102f4

@SInCE

Add functions to `DeltaTable` to perform optimization. API documentation: ``` /** * Optimize the data layout of the table. This returns * a [[DeltaOptimizeBuilder]] object that can be used to specify * the partition filter to limit the scope of optimize and * also execute different optimization techniques such as file * compaction or order data using Z-Order curves. * * See the [[DeltaOptimizeBuilder]] for a full description * of this operation. * * Scala example to run file compaction on a subset of * partitions in the table: * {{{ * deltaTable * .optimize() * .where("date='2021-11-18'") * .executeCompaction(); * }}} * * @SInCE 1.3.0 */ def optimize(): DeltaOptimizeBuilder ``` ``` /** * Builder class for constructing OPTIMIZE command and executing. * * @param sparkSession SparkSession to use for execution * @param tableIdentifier Id of the table on which to * execute the optimize * @SInCE 1.3.0 */ class DeltaOptimizeBuilder( sparkSession: SparkSession, tableIdentifier: String) extends AnalysisHelper { /** * Apply partition filter on this optimize command builder to limit * the operation on selected partitions. * @param partitionFilter The partition filter to apply * @return [[DeltaOptimizeBuilder]] with partition filter applied */ def where(partitionFilter: String): DeltaOptimizeBuilder /** * Compact the small files in selected partitions. * @return DataFrame containing the OPTIMIZE execution metrics */ def executeCompaction(): DataFrame } ``` Closes delta-io#961 Fixes delta-io#960 Signed-off-by: Venki Korukanti <[email protected]> GitOrigin-RevId: 615e215b96fb9e9b9223d3d2b429dc18dff102f4

vkorukanti reviewed Feb 25, 2022

View reviewed changes

Kimahriman force-pushed the optimize-scala branch from 6a9e456 to b9e5c99 Compare February 27, 2022 20:59

zsxwing self-requested a review March 7, 2022 20:22

Kimahriman force-pushed the optimize-scala branch 2 times, most recently from 28f9360 to d9f43ba Compare April 13, 2022 14:53

Kimahriman force-pushed the optimize-scala branch from d9f43ba to 91f2e0c Compare April 13, 2022 15:30

vkorukanti approved these changes Apr 13, 2022

View reviewed changes

vkorukanti mentioned this pull request Apr 19, 2022

Add pypsark api for Optimize #1080

Closed

vkorukanti approved these changes Apr 19, 2022

View reviewed changes

vkorukanti reviewed Apr 20, 2022

View reviewed changes

burakyilmaz321 mentioned this pull request Apr 25, 2022

Add python api for optimize #1091

Closed

2 tasks

Kimahriman added 5 commits April 27, 2022 07:46

Add scala API for Optimize

d318e0f

Use OptimizeTableCommand in Scala APi and unify tests

f79331b

Implement optimize using the builder pattern

d1e0b8e

Fix docs, add test check back in, and since

5d4836f

Remove failing links

ff2c5e9

Kimahriman force-pushed the optimize-scala branch from f223755 to ff2c5e9 Compare April 27, 2022 11:46

vkorukanti reviewed Apr 28, 2022

View reviewed changes

vkorukanti closed this in 198a4bb May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Scala API for optimize #961

Add Scala API for optimize #961

Kimahriman commented Feb 24, 2022

vkorukanti left a comment

vkorukanti Feb 25, 2022

Kimahriman Feb 25, 2022

Kimahriman Feb 25, 2022

vkorukanti Feb 25, 2022 •

edited

Loading

Kimahriman Feb 26, 2022

Kimahriman Feb 27, 2022

vkorukanti Feb 25, 2022

Kimahriman Feb 27, 2022

vkorukanti Feb 25, 2022

Kimahriman Feb 25, 2022

Kimahriman Feb 27, 2022

vkorukanti commented Mar 22, 2022

vkorukanti commented Apr 1, 2022 •

edited

Loading

Kimahriman commented Apr 2, 2022

Kimahriman commented Apr 13, 2022

Kimahriman commented Apr 13, 2022

vkorukanti left a comment

vkorukanti Apr 13, 2022

vkorukanti Apr 13, 2022

Kimahriman Apr 13, 2022

vkorukanti Apr 13, 2022

vkorukanti commented Apr 13, 2022

Kimahriman commented Apr 13, 2022

vkorukanti commented Apr 19, 2022 •

edited

Loading

vkorukanti Apr 20, 2022

Kimahriman Apr 20, 2022

vkorukanti Apr 28, 2022

Kimahriman Apr 28, 2022

vkorukanti Apr 28, 2022

Add Scala API for optimize #961

Add Scala API for optimize #961

Conversation

Kimahriman commented Feb 24, 2022

vkorukanti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti Feb 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti commented Mar 22, 2022

vkorukanti commented Apr 1, 2022 • edited Loading

Kimahriman commented Apr 2, 2022

Kimahriman commented Apr 13, 2022

Kimahriman commented Apr 13, 2022

vkorukanti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti commented Apr 13, 2022

Kimahriman commented Apr 13, 2022

vkorukanti commented Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti Feb 25, 2022 •

edited

Loading

vkorukanti commented Apr 1, 2022 •

edited

Loading

vkorukanti commented Apr 19, 2022 •

edited

Loading