[Feature Request][Spark] Pushdown "order by" with "limit" operation by using Delta metadata #2421
Open
2 of 8 tasks
Labels
enhancement
New feature or request
Feature request
Which Delta project/connector is this regarding?
Overview
A very typical use case while doing exploratory analysis is to check latest records with some limit, mostly to understand data pattern and behaviour. e.g.
select * from table order by timestamp desc limit 10
In normal scenario Spark would read all the files to get to top 10 records. However, if timestamp column creates mostly disjointed sets for each file we can just read the min/max & number of record to determine the top 10 records.
In the case of non disjoint sets also, we can improve the performance by reading a subset of files up to the number specified in the limit. In the above example it would be 10 files.
Motivation
Sorting the whole table can take number of minutes for 500GB + tables. Reading the metadata would give this information in seconds.
Further details
An example on disjoint sets
Query : select * from table order by timestamp desc limit 10
With the query being
select * from table order by timestamp desc limit 10
right now we need to read all the files. However, if we can make use of the metadata, we only need to read file number 3.An example on non disjoint sets
Query : select * from table order by timestamp asc limit 10
While working with non disjoint sets of file we can follow the below algorithm,
The same principle can be applied even after a partition filter is applied.
Limitation :
It would be applicable only in the case of a single order by clause.
Even though it's applicable to very limited set of queries, the frequency of such queries are very high.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: