-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix subquery alias #1067
fix subquery alias #1067
Conversation
|
Though the PR fix the bug mentioned by the issue, there are still some bugs. I am confused and need help. select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
extract(year from l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
)
and l_shipdate between date '1995-01-01' and date '1996-12-31'
) as shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year; Then I printed some logs to find the potential problems.
It's OK? Then I tried to find problems in https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1094 because the test panicked at here( The projected_expr in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs#L255 is the following
But the plan and the plan schema are the followings
So there will be conflict in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/dfschema.rs#L150 I have two questions
PTAL and give me some help @alamb @houqp @Dandandan. Thanks very much! |
ff057a5
to
fc3161e
Compare
Yes, i think the logical plan you got there is correct.
I believe this is due to the alias patching is done at the sql planner layer, not the plan builder layer. We only have this alias info in the sql planner right now (parsed from the sql query). When you patch the projection in the SQL planner with the alias, this alias info is lost in the logical plan. In other words, it's not possible to reconstruct the same alias patched projection plan node using only the information provided by the logical plan tree. If you look at the children of the patched projection, there is no mentioning of that subquery alias b at all.
In order to make the full plan serializable without access to the raw SQL query, we need to add the subquery alias to the logical plan tree as well. During protobuf plan ser/de in ballista, we don't pass along the SQL query, but only the planned logical plan from the SQL planner. I can see two ways to accomplish this: First approach is to add an optional alias field to our projection plan node similar to what we have with union: Then we can perform the schema qualifier patch in the plan builder's project method similar to what we do with union alias: Second approach is to introduce a new type of Alias plan node that we can use to wrap any plan node to perform this qualifier patching logic. I think adding an alias field to the projection plan node would be simpler. @alamb @Dandandan @jorgecarleitao @andygrove WDTY? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xudong963 -- this is a great contribution
I think this PR is on the right track. After reading the code and comments, I agree with @houqp's analysis, and I think his suggestion of adding an alias field to LogicalPlan::Projection
will be the cleanest approach
First approach is to add an optional alias field to our projection plan node similar to what we have with union:
You would also have to change the logic that computes the output DFSchema
to account for this alias, but then I bet everything else "would just work"
self.query_to_plan_with_alias( | ||
} => { | ||
// if alias is None, return Err | ||
if alias.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is consistent with Postgres:
alamb=# select * from public.simple as a join (select * from public.simple) on a.c3;
ERROR: subquery in FROM must have an alias
LINE 1: select * from public.simple as a join (select * from public....
^
HINT: For example, FROM (SELECT ...) [AS] foo.
👍
datafusion/src/sql/planner.rs
Outdated
// if alias is None, return Err | ||
if alias.is_none() { | ||
return Err(DataFusionError::Plan( | ||
"subquery in FROM must have an alias".parse().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"subquery in FROM must have an alias".parse().unwrap(), | |
"subquery in FROM must have an alias".to_string(), |
I think you can just create a String
here
datafusion/src/logical_plan/plan.rs
Outdated
@@ -298,6 +298,19 @@ impl LogicalPlan { | |||
| LogicalPlan::Filter { input, .. } => input.all_schemas(), | |||
} | |||
} | |||
/// schema to projection logical plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems like it tries to specify that the output schema's relation name (table alias) that is different than than the input schema's relation.
I think what is desired is a new LogicalPlan::Projection that changed the schema names rather than trying to rewrite them.
For example, to change the table alias from a
to b
If the input was like
LogicalPlan::Projection(schema = {a.c1, a.c2}, expr: [a.c1, a.c2])
As @houqp says we should add a single new LogicalPlan node like:
LogicalPlan::Projection(schema = {b.c1, b.c2}, expr: [a.c1, a.c2])
LogicalPlan::Projection(schema = {a.c1, a.c2}, expr: [a.c1, a.c2])
(in other words, don't try and rewrite the existing LogicalPlan::Projection
, but put a new one on the top that changes the schema)
If you use @houqp 's suggestion to add an optional alias to LogicalPlan::Projection
then the top LogicalPlan can be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good idea! o( ̄▽ ̄)d
Keep up to date with the latest developments: @houqp @alamb
One question I want to ask is if we use the strictly restrict for subquery to bring into correspondence with pg? If so, I will try to fix unlucky tests. |
yes, i think it's fine to fix the test instead, datafusion claims to be postgres compatible, so we want to be as close to postgres as possible. |
fc3161e
to
6a79299
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great @xudong963 -- thank you very much!
@@ -2090,7 +2090,7 @@ mod tests { | |||
let results = plan_and_collect( | |||
&mut ctx, | |||
"SELECT * FROM t as t1 \ | |||
JOIN (SELECT * FROM t as t2) \ | |||
JOIN (SELECT * FROM t) as t2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
validate_unique_names("Projections", projected_expr.iter(), input_schema)?; | ||
|
||
let schema = DFSchema::new(exprlist_to_fields(&projected_expr, input_schema)?)?; | ||
Ok(Self::from(project_with_alias( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you could write this like self.project_with_alias(expr, None)
which might be slightly cleaner
Some(ref alias) => input_schema.replace_qualifier(alias.as_str()), | ||
None => input_schema, | ||
}; | ||
Ok(LogicalPlan::Projection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I plan to merge this tomorrow morning, Eastern Time, if no one else has done so (I want to let @houqp have a chance to review if he would like to) |
b0770be
to
9113f8f
Compare
datafusion/src/sql/planner.rs
Outdated
let mut df_fields_with_alias = Vec::new(); | ||
for df_field in schema.fields().iter() { | ||
let df_field_with_alias = DFField::from_qualified( | ||
&alias.as_ref().unwrap().name.value, | ||
df_field.field().clone(), | ||
); | ||
df_fields_with_alias.push(df_field_with_alias); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like df_fields_with_alias
is not being used anymore? the schema patching should have been covered by project_with_alias
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good sight!
datafusion/src/sql/planner.rs
Outdated
.map(|(field, ident)| { | ||
col_with_table_name( | ||
field.name(), | ||
&*(alias.clone().name.value), | ||
) | ||
.alias(ident.value.as_str()) | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a closer look at the code, I now agree with @alamb that we don't need to do a 2nd round of patching here. The fields
comes from plan.schema().fields()
. The schema for the plan we referenced here has already been patched in the previous relation match block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking again, I agree. Because we wrapped a projection plan with alias in the previous TableFactor::Derived
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 could you also update LogicalPlan's pub fn display
method to print the projection alias? This should help make the logical plan more readable :)
No problem |
9113f8f
to
4986e40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work @xudong963 ! I will let @alamb do the final merge when he wakes up :)
4986e40
to
4790322
Compare
CI seems unstable. =================================== FAILURES ===================================
_____________________________ test_math_functions ______________________________
df = <builtins.DataFrame object at 0x7f37c5055630>
def test_math_functions(df):
values = np.array([0.1, -0.7, 0.55])
col_v = f.col("value")
df = df.select(
f.abs(col_v),
f.sin(col_v),
f.cos(col_v),
f.tan(col_v),
f.asin(col_v),
f.acos(col_v),
f.exp(col_v),
f.ln(col_v + f.lit(1)),
f.log2(col_v + f.lit(1)),
f.log10(col_v + f.lit(1)),
f.random(),
)
> result = df.collect()
E Exception: DataFusion error: Plan("No field named '<unqualified>.0QsBN24Phk.value'. Valid fields are '0QsBN24Phk.value'.") |
Yes, we can ignore that CI test for now. |
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you again @xudong963 !
), | ||
)?; | ||
( | ||
project_with_alias( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer
Which issue does this PR close?
Closes #1049
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?