Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unparsing for ORDER BY, UNION, Windows functions with Aggregation #12946

Merged
merged 7 commits into from
Oct 17, 2024

Conversation

sgrebnov
Copy link
Member

@sgrebnov sgrebnov commented Oct 15, 2024

Which issue does this PR close?

Improves unparsing to produce valid sql for TPC-DS benchmark queries:

Improve UNION unparsing

Fix UNION unparsing for queries where UNION is a subquery

Original query

select
	channel,
	col_name
from
	(
	select
		'store' as channel,
		'ss_customer_sk' as col_name
	from
		store_sales
union all
	select
		'web' as channel,
		'ws_ship_hdemo_sk' as col_name
	from
		web_sales
    ) as foo
group by
	channel,
	col_name
order by
	channel,
	col_name

Before (invalid):

select
	'store' as "channel",
	'ss_customer_sk' as "col_name"
from
	"store_sales"
union all
select
	'web' as "channel",
	'ws_ship_hdemo_sk' as "col_name"
from
	"web_sales"
order by
	"foo"."channel" asc nulls last,
	"foo"."col_name" asc nulls last

After (correct)

select
	"foo"."channel",
	"foo"."col_name"
from
	(
	select
		'store' as "channel",
		'ss_customer_sk' as "col_name"
	from
		"store_sales"
union all
	select
		'web' as "channel",
		'ws_ship_hdemo_sk' as "col_name"
	from
		"web_sales") as "foo"
group by
	"foo"."channel",
	"foo"."col_name"
order by
	"foo"."channel" asc nulls last,
	"foo"."col_name" asc nulls last

Fix unparsing for ORDER BY with Aggregation functions

Improve unparsing to handle cases when ORDER BY clause contains aggregation functions that needs to be unprojected, otherwise it will result in unparsed/incorrect sql - Referenced column "sum(catalog_returns.cr_net_loss)" not found in FROM clause / Referenced column "count(DISTINCT cs1.cs_order_number)" not found in FROM clause

For example TPC-DS Q42

Referenced column "sum(store_sales.ss_ext_sales_price)" not found in FROM clause

select  dt.d_year
 	,item.i_category_id
 	,item.i_category
 	,sum(ss_ext_sales_price)
 from 	date_dim dt
 	,store_sales
 	,item
 where dt.d_date_sk = store_sales.ss_sold_date_sk
 	and store_sales.ss_item_sk = item.i_item_sk
 	and item.i_manager_id = 1
 	and dt.d_moy=11
 	and dt.d_year=1998
 group by 	dt.d_year
 		,item.i_category_id
 		,item.i_category
 order by       sum(ss_ext_sales_price) desc,dt.d_year
 		,item.i_category_id
 		,item.i_category
 LIMIT 100 ;

Improve scalar functions in ORDER BY unparsing

Improve unparsing to correctly handle scalar functions in ORDER BY. If SELECT and ORDER BY contain the same expression with a scalar function, the ORDER BY expression will be replaced by a Column expression (e.g., "substr(customer.c_last_name, Int64(0), Int64(5))"), and we need to transform it back to the actual expression. This is done for scalar function expressions only.

  select c_last_name,c_first_name, substr(c_last_name,0,5)
  from customer
  order by substr(c_last_name,0,5)
  limit 10;

Before

select
	"customer"."c_last_name",
	"customer"."c_first_name",
	substr("customer"."c_last_name",0,5)
from
	"customer"
order by
	"substr(customer.c_last_name,Int64(0),Int64(5))" asc nulls last
limit 10

After

select
	"customer"."c_last_name",
	"customer"."c_first_name",
	substr("customer"."c_last_name",0,5)
from
	"customer"
order by
	substr("customer"."c_last_name",0,5) asc nulls last
limit 10

Fix unparsing for complex Window functions with Aggregation

Improve unproject_agg_exprs to unparse complex Window functions with aggregation when aggregation functions present in When/Then and Order By parts.

For example case when grouping(i_class) = 0 then i_category end and order by sum(ws_net_paid) desc in Window rank():

select
    sum(ws_net_paid) as total_sum
   ,i_category
   ,i_class
   ,grouping(i_category)+grouping(i_class) as lochierarchy
   ,rank() over (
 	partition by grouping(i_category)+grouping(i_class),
 	case when grouping(i_class) = 0 then i_category end
 	order by sum(ws_net_paid) desc) as rank_within_parent
 from
    web_sales
   ,date_dim       d1
   ,item
 where
    d1.d_month_seq between 1205 and 1205+11
 and d1.d_date_sk = ws_sold_date_sk
 and i_item_sk  = ws_item_sk
 group by rollup(i_category,i_class)
 order by
   lochierarchy desc,
   case when lochierarchy = 0 then i_category end,
   rank_within_parent
  LIMIT 100;

Are these changes tested?

Added automated tests

Are there any user-facing changes?

Unparser will generate valid sql for complex Window functions and Order By w/ aggregations and Union and Order By with Aggregation functions

@github-actions github-actions bot added the sql SQL Planner label Oct 15, 2024
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgrebnov, I think overall is good but one question and minor suggestions.

datafusion/sql/src/unparser/utils.rs Outdated Show resolved Hide resolved
datafusion/sql/tests/cases/plan_to_sql.rs Show resolved Hide resolved
@sgrebnov sgrebnov force-pushed the sgrebnov/improve-unparsing branch from 187f5e0 to 28670e3 Compare October 17, 2024 06:50
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @sgrebnov and very much appreciate the help reviewing @goldmedal

@alamb alamb merged commit ad273ca into apache:main Oct 17, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants