New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimize Min/Max using Delta metadata #1525

Closed

felipepessoto wants to merge 13 commits into delta-io:master from felipepessoto:improvedatafromstats

+1,178 −226

Contributor

felipepessoto commented Dec 17, 2022 •

edited by vkorukanti

Loading

Description

Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.

How was this patch tested?

Created additional unit tests to cover MIN/MAX.

Does this PR introduce any user-facing changes?

Only performance improvement

scottsand-db assigned scottsand-db and unassigned scottsand-db

scottsand-db self-requested a review

December 21, 2022 18:57

Collaborator

scottsand-db commented Jan 3, 2023

@felipepessoto just following up on this PR - is it still a WIP?

Contributor Author

felipepessoto commented Jan 3, 2023

Yes, I made these changes while the SELECT Count was in review, I think I can refine this.

vkorukanti mentioned this pull request

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed

3 tasks

felipepessoto force-pushed the improvedatafromstats branch from a3ace44 to 7f665e5 Compare

January 27, 2023 20:29

felipepessoto changed the title ~~[WIP] Optimize Min/Max using Delta stats~~ Optimize Min/Max using Delta stats

Contributor Author

felipepessoto commented Jan 28, 2023

@scottsand-db it is ready to review. Thanks

felipepessoto changed the title ~~Optimize Min/Max using Delta stats~~ Optimize Min/Max using Delta metadata

felipepessoto mentioned this pull request

[BUG] Fix COUNT(*) aggregate pushdown with the .show() command #1571

Closed

vkorukanti reviewed

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

felipepessoto force-pushed the improvedatafromstats branch from 7583cf2 to df2f58c Compare

March 10, 2023 02:07

scottsand-db reviewed

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

felipepessoto force-pushed the improvedatafromstats branch 2 times, most recently from d118b5b to a065457 Compare

March 31, 2023 03:27

Contributor Author

felipepessoto commented Apr 13, 2023

Hi folks, did you have a chance to review this?
Thanks

Contributor Author

felipepessoto commented Apr 20, 2023

Contributor Author

felipepessoto commented May 1, 2023

@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release?

scottsand-db requested review from vkorukanti and scottsand-db

May 4, 2023 18:48

felipepessoto mentioned this pull request

[QUESTION] Getting max value for partitioned column based on metadata #1774

Open

Contributor Author

felipepessoto commented May 25, 2023 •

edited

Loading

@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5.

And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc

Contributor Author

felipepessoto commented Jun 16, 2023

@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes.

Collaborator

scottsand-db commented Jun 16, 2023

@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can!

Contributor Author

felipepessoto commented Jun 29, 2023

felipepessoto mentioned this pull request

[Feature Request] optimize COUNT(*) on partitioned tables #1916

Open

8 tasks

henlue commented Aug 31, 2023

I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement.

There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:

when incrementally loading data to a table, often the first step is to query the max timestamp of that table in order to figure out from where to continue loading more data
BI tools will query the min max values of columns to configure the ranges for their filters or slicers

Contributor Author

felipepessoto commented Sep 22, 2023

We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned.

#1192
#1916
#1377

@scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features?

Thanks

felipepessoto mentioned this pull request

[Feature Request][Spark] Optimize Min/Max using Delta metadata #2092

Closed

8 tasks

felipepessoto force-pushed the improvedatafromstats branch 2 times, most recently from 2c76d0a to f798b76 Compare

November 22, 2023 12:47

felipepessoto added 3 commits

November 22, 2023 04:50


          Refactor to address PR review

61dbdc9

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>


          Aggregates similar tests into a matrix.

0bd8be7

Add column mapping tests using the existing traits.
Add test using partitioned column filter

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>


          Only extract the columns used in the query and avoids reading partiti…

6ba89ce

…on values if all values were found in stats

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

felipepessoto force-pushed the improvedatafromstats branch 3 times, most recently from f290bc6 to 9a8feb9 Compare

November 22, 2023 13:04


          Extract Count and Min/Max in a single method. Allows to extract Min/M…

2c70c6b

…ax from partitioned columns even when COUNT is not available

Fix style error

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

felipepessoto force-pushed the improvedatafromstats branch from 9a8feb9 to 2c70c6b Compare

November 22, 2023 13:12

vkorukanti reviewed

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala

+                        c@Count(Seq(Literal(1, _))), Complete, false, None, _) =>
+                          Some(c)
+                      case AggregateExpression(
+                        min@Min(minExpr), Complete, false, None, _) if isSupportedDataType(minExpr.dataType) =>

Collaborator

vkorukanti Jan 2, 2024

can we make the minExpr (also maxExpr) into a match

object SkippingEligibleColumn {
  // returns attribute name and data type
  def unapply(arg: Expression): Option[(Seq[String], DataType)] = {
      // Here also check whether the arg is an AtributeReference or not. 
      // not a nested column
      // and even the data type check as well.
  }
}

Collaborator

vkorukanti Jan 2, 2024

the same object.unapply can be used in PhysicalOperation matching.

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala

+                    .map(x => x._1).toSet
+                  // Creates a tuple with physical name to avoid recalculating it multiple times
+                  val dataColumnsWithStats = dataColumns.map(x => (x, DeltaColumnMapping.getPhysicalName(x)))

Collaborator

vkorukanti Jan 2, 2024

one suggestion to simplify the code:

Add a utility method to get the Column ref for min/max/nullCount/numRecords for regular or partition columns from the Dataframe deltaScanGenerator.filesWithStatsForScan. It abstracts out the physical name conversion and the lookup for partition or data column. If the column is a partition column, then it also takes care of type-casting the string partition value to the appropriate data type value.
The next step is to construct the expression using these refs that validate the stats and then return the min and max. the existing expression you have should work.

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala Show resolved Hide resolved

lzlfred reviewed

View reviewed changes

Contributor

lzlfred left a comment

LGTM. minor comments.

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

felipepessoto added 2 commits

January 4, 2024 17:20


          Add Unit Tests:

a36413c

-table with DVs
-empty table
-table with few AddFiles having zero rows

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>


          Refactor to address PR comments

ce394fb

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

felipepessoto force-pushed the improvedatafromstats branch from 262ca37 to ce394fb Compare

January 5, 2024 01:21

weiluo-db reviewed

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala Show resolved Hide resolved

rishitesh-snt mentioned this pull request

[Feature Request][Spark] Pushdown "order by" with "limit" operation by using Delta metadata #2421

Open

8 tasks


          Refactor to address PR comments 2

b2b4984

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

weiluo-db reviewed

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved


          Refactor to address PR comments 3

923c423

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

weiluo-db approved these changes

View reviewed changes

Contributor

weiluo-db left a comment

LGTM (pending @vkorukanti 's final pass)!

vkorukanti reviewed

View reviewed changes

Collaborator

vkorukanti left a comment

lgtm pending one comment.

spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala Outdated Show resolved Hide resolved


          Refactor to address PR comments 4. Collect validation results with op…

3549f63

…timization disabled

Signed-off-by: Felipe Fujiy Pessoto <[email protected]>

vkorukanti approved these changes

View reviewed changes

Collaborator

vkorukanti left a comment

lgtm

Thank you for contributing this optimizaiton.


          import ordering

42b3edb

Contributor Author

felipepessoto commented Jan 9, 2024

Flink tests are flaky? Previous build succeeded

vkorukanti closed this in

384e38b

felipepessoto deleted the improvedatafromstats branch

January 9, 2024 18:13

felipepessoto mentioned this pull request

[Feature Request][Spark][WIP] Metadata only queries - Umbrella issue #2589

Open

8 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet