fix subquery alias #1067

xudong963 · 2021-10-03T08:19:29Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

xudong963 · 2021-10-03T08:21:02Z

> CREATE EXTERNAL TABLE customer STORED AS CSV LOCATION '/Users/bytedance/arrow-datafusion/datafusion/tests/customer.csv';
0 rows in set. Query took 0.014 seconds.
> explain select * from customer as a join (select * from customer as b) on a.column_1=b.column_1;
Plan("subquery in FROM must have an alias")
> explain select * from customer as a join (select * from customer) as b on a.column_1=b.column_1;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                     |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #a.column_1, #a.column_2, #b.column_1, #b.column_2                                                                                                                           |
|               |   Join: #a.column_1 = #b.column_1                                                                                                                                                        |
|               |     TableScan: a projection=Some([0, 1])                                                                                                                                                 |
|               |     Projection: #customer.column_1, #customer.column_2                                                                                                                                   |
|               |       TableScan: customer projection=Some([0, 1])                                                                                                                                        |
| physical_plan | ProjectionExec: expr=[column_1@0 as column_1, column_2@1 as column_2, column_1@2 as column_1, column_2@3 as column_2]                                                                    |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                            |
|               |     HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "column_1", index: 0 }, Column { name: "column_1", index: 0 })]                                                 |
|               |       CoalesceBatchesExec: target_batch_size=4096                                                                                                                                        |
|               |         RepartitionExec: partitioning=Hash([Column { name: "column_1", index: 0 }], 12)                                                                                                  |
|               |           RepartitionExec: partitioning=RoundRobinBatch(12)                                                                                                                              |
|               |             CsvExec: source=Path(/Users/bytedance/arrow-datafusion/datafusion/tests/customer.csv: [/Users/bytedance/arrow-datafusion/datafusion/tests/customer.csv]), has_header=false   |
|               |       CoalesceBatchesExec: target_batch_size=4096                                                                                                                                        |
|               |         RepartitionExec: partitioning=Hash([Column { name: "column_1", index: 0 }], 12)                                                                                                  |
|               |           ProjectionExec: expr=[column_1@0 as column_1, column_2@1 as column_2]                                                                                                          |
|               |             RepartitionExec: partitioning=RoundRobinBatch(12)                                                                                                                            |
|               |               CsvExec: source=Path(/Users/bytedance/arrow-datafusion/datafusion/tests/customer.csv: [/Users/bytedance/arrow-datafusion/datafusion/tests/customer.csv]), has_header=false |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.010 seconds.

xudong963 · 2021-10-03T08:46:48Z

Though the PR fix the bug mentioned by the issue, there are still some bugs. I am confused and need help.
When I ran the cargo test, I found there were four tests that couldn't pass. For example tests::ballista_round_trip::q7
The following is the SQL executed in tests::ballista_round_trip::q7

select
    supp_nation,
    cust_nation,
    l_year,
    sum(volume) as revenue
from
    (
        select
            n1.n_name as supp_nation,
            n2.n_name as cust_nation,
            extract(year from l_shipdate) as l_year,
            l_extendedprice * (1 - l_discount) as volume
        from
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2
        where
                s_suppkey = l_suppkey
          and o_orderkey = l_orderkey
          and c_custkey = o_custkey
          and s_nationkey = n1.n_nationkey
          and c_nationkey = n2.n_nationkey
          and (
                (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
                or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
            )
          and l_shipdate between date '1995-01-01' and date '1996-12-31'
    ) as shipping
group by
    supp_nation,
    cust_nation,
    l_year
order by
    supp_nation,
    cust_nation,
    l_year;

Then I printed some logs to find the potential problems.
First of all, I printed the plan at https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1092

Sort: #shipping.supp_nation ASC NULLS FIRST, #shipping.cust_nation ASC NULLS FIRST, #shipping.l_year ASC NULLS FIRST
  Projection: #shipping.supp_nation, #shipping.cust_nation, #shipping.l_year, #shipping.volume
    Projection: #n1.n_name AS supp_nation, #n2.n_name AS cust_nation, datepart(Utf8("YEAR"), #lineitem.l_shipdate) AS l_year, #lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount AS volume
      Filter: #n1.n_name = Utf8("FRANCE") AND #n2.n_name = Utf8("GERMANY") OR #n1.n_name = Utf8("GERMANY") AND #n2.n_name = Utf8("FRANCE") AND #lineitem.l_shipdate BETWEEN CAST(Utf8("1995-01-01") AS Date32) AND CAST(Utf8("1996-12-31") AS Date32)
        Join: #customer.c_nationkey = #n2.n_nationkey
          Join: #supplier.s_nationkey = #n1.n_nationkey
            Join: #orders.o_custkey = #customer.c_custkey
              Join: #lineitem.l_orderkey = #orders.o_orderkey
                Join: #supplier.s_suppkey = #lineitem.l_suppkey
                  TableScan: supplier projection=None
                  TableScan: lineitem projection=None
                TableScan: orders projection=None
              TableScan: customer projection=None
            TableScan: n1 projection=None
          TableScan: n2 projection=None

It's OK?

Then I tried to find problems in https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1094 because the test panicked at here(Logical Plan Proto to Logical Plan)
Finally, I found the panic at building projection logical plan https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs#L240

The projected_expr in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs#L255 is the following

[#shipping.supp_nation, #shipping.cust_nation, #shipping.l_year, #shipping.volume]

But the plan and the plan schema are the followings

 Projection: #n1.n_name AS supp_nation, #n2.n_name AS cust_nation, datepart(Utf8("YEAR"), #lineitem.l_shipdate) AS l_year, #lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount AS volume
      Filter: #n1.n_name = Utf8("FRANCE") AND #n2.n_name = Utf8("GERMANY") OR #n1.n_name = Utf8("GERMANY") AND #n2.n_name = Utf8("FRANCE") AND #lineitem.l_shipdate BETWEEN CAST(Utf8("1995-01-01") AS Date32) AND CAST(Utf8("1996-12-31") AS Date32)
        Join: #customer.c_nationkey = #n2.n_nationkey
          Join: #supplier.s_nationkey = #n1.n_nationkey
            Join: #orders.o_custkey = #customer.c_custkey
              Join: #lineitem.l_orderkey = #orders.o_orderkey
                Join: #supplier.s_suppkey = #lineitem.l_suppkey
                  TableScan: supplier projection=None
                  TableScan: lineitem projection=None
                TableScan: orders projection=None
              TableScan: customer projection=None
            TableScan: n1 projection=None
          TableScan: n2 projection=None

DFSchema { fields: [DFField { qualifier: Some("n1"), field: Field { name: "supp_nation", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None } }, DFField { qualifier: Some("n2"), field: Field { name: "cust_nation", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None } }, DFField { qualifier: None, field: Field { name: "l_year", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None } }, DFField { qualifier: None, field: Field { name: "volume", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None } }] }

So there will be conflict in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/dfschema.rs#L150

I have two questions

https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1092 is ok and get a plan, the plan is correct?
Which part makes Logical Plan Proto to Logical Plan get errors, now I don't know how to fix it.

PTAL and give me some help @alamb @houqp @Dandandan. Thanks very much!

houqp · 2021-10-03T20:19:42Z

https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1092 is ok and get a plan, the plan is correct?

Yes, i think the logical plan you got there is correct.

Which part makes Logical Plan Proto to Logical Plan get errors, now I don't know how to fix it.

I believe this is due to the alias patching is done at the sql planner layer, not the plan builder layer. We only have this alias info in the sql planner right now (parsed from the sql query). When you patch the projection in the SQL planner with the alias, this alias info is lost in the logical plan. In other words, it's not possible to reconstruct the same alias patched projection plan node using only the information provided by the logical plan tree.

If you look at the children of the patched projection, there is no mentioning of that subquery alias b at all.

|               |     TableScan: a projection=Some([0, 1])                                                                                                                                                 |
|               |     Projection: #customer.column_1, #customer.column_2                                                                                                                                   |
|               |       TableScan: customer projection=Some([0, 1])

In order to make the full plan serializable without access to the raw SQL query, we need to add the subquery alias to the logical plan tree as well. During protobuf plan ser/de in ballista, we don't pass along the SQL query, but only the planned logical plan from the SQL planner.

I can see two ways to accomplish this:

First approach is to add an optional alias field to our projection plan node similar to what we have with union:

https://github.com/apache/arrow-datafusion/blob/2f04d67156ec91afa628a3ed47003b8f992450bf/datafusion/src/logical_plan/plan.rs#L160

Then we can perform the schema qualifier patch in the plan builder's project method similar to what we do with union alias:

https://github.com/apache/arrow-datafusion/blob/2f04d67156ec91afa628a3ed47003b8f992450bf/datafusion/src/logical_plan/builder.rs#L584

Second approach is to introduce a new type of Alias plan node that we can use to wrap any plan node to perform this qualifier patching logic.

I think adding an alias field to the projection plan node would be simpler. @alamb @Dandandan @jorgecarleitao @andygrove WDTY?

alamb

Thanks @xudong963 -- this is a great contribution

I think this PR is on the right track. After reading the code and comments, I agree with @houqp's analysis, and I think his suggestion of adding an alias field to LogicalPlan::Projection will be the cleanest approach

First approach is to add an optional alias field to our projection plan node similar to what we have with union:

You would also have to change the logic that computes the output DFSchema to account for this alias, but then I bet everything else "would just work"

alamb · 2021-10-04T10:33:26Z

datafusion/src/sql/planner.rs

-                self.query_to_plan_with_alias(
+            } => {
+                // if alias is None, return Err
+                if alias.is_none() {


FWIW this is consistent with Postgres:

alamb=# select * from public.simple as a join (select * from public.simple) on a.c3; ERROR: subquery in FROM must have an alias LINE 1: select * from public.simple as a join (select * from public.... ^ HINT: For example, FROM (SELECT ...) [AS] foo.

👍

alamb · 2021-10-04T10:33:55Z

datafusion/src/sql/planner.rs

+                // if alias is None, return Err
+                if alias.is_none() {
+                    return Err(DataFusionError::Plan(
+                        "subquery in FROM must have an alias".parse().unwrap(),


Suggested change

"subquery in FROM must have an alias".parse().unwrap(),

"subquery in FROM must have an alias".to_string(),

I think you can just create a String here

alamb · 2021-10-04T10:46:50Z

datafusion/src/logical_plan/plan.rs

@@ -298,6 +298,19 @@ impl LogicalPlan {
            | LogicalPlan::Filter { input, .. } => input.all_schemas(),
        }
    }
+    /// schema to projection logical plan


This code seems like it tries to specify that the output schema's relation name (table alias) that is different than than the input schema's relation.

I think what is desired is a new LogicalPlan::Projection that changed the schema names rather than trying to rewrite them.

For example, to change the table alias from a to b

If the input was like

LogicalPlan::Projection(schema = {a.c1, a.c2}, expr: [a.c1, a.c2])

As @houqp says we should add a single new LogicalPlan node like:

LogicalPlan::Projection(schema = {b.c1, b.c2}, expr: [a.c1, a.c2]) LogicalPlan::Projection(schema = {a.c1, a.c2}, expr: [a.c1, a.c2])

(in other words, don't try and rewrite the existing LogicalPlan::Projection, but put a new one on the top that changes the schema)

If you use @houqp 's suggestion to add an optional alias to LogicalPlan::Projection then the top LogicalPlan can be created

yes, good idea! o(￣▽￣)ｄ

datafusion/src/sql/planner.rs

xudong963 · 2021-10-06T06:49:05Z

Keep up to date with the latest developments: @houqp @alamb

Use the suggestion from @houqp, four tests about ballista successfully passed!😁
But due to use strictly restrict for subquery subquery in FROM must have an alias, unluckily some tests don't pass now.

One question I want to ask is if we use the strictly restrict for subquery to bring into correspondence with pg? If so, I will try to fix unlucky tests.

houqp · 2021-10-06T07:24:45Z

yes, i think it's fine to fix the test instead, datafusion claims to be postgres compatible, so we want to be as close to postgres as possible.

xudong963 · 2021-10-07T08:47:19Z

PTAL😄, thanks! @houqp @alamb

alamb

I think this looks great @xudong963 -- thank you very much!

alamb · 2021-10-07T16:17:16Z

datafusion/src/execution/context.rs

@@ -2090,7 +2090,7 @@ mod tests {
        let results = plan_and_collect(
            &mut ctx,
            "SELECT * FROM t as t1  \
-             JOIN (SELECT * FROM t as t2) \
+             JOIN (SELECT * FROM t) as t2 \


alamb · 2021-10-07T16:18:14Z

datafusion/src/logical_plan/builder.rs

-        validate_unique_names("Projections", projected_expr.iter(), input_schema)?;
-
-        let schema = DFSchema::new(exprlist_to_fields(&projected_expr, input_schema)?)?;
+        Ok(Self::from(project_with_alias(


I wonder if you could write this like self.project_with_alias(expr, None) which might be slightly cleaner

alamb · 2021-10-07T16:19:22Z

datafusion/src/logical_plan/builder.rs

+        Some(ref alias) => input_schema.replace_qualifier(alias.as_str()),
+        None => input_schema,
+    };
+    Ok(LogicalPlan::Projection {


alamb · 2021-10-07T21:00:17Z

I plan to merge this tomorrow morning, Eastern Time, if no one else has done so (I want to let @houqp have a chance to review if he would like to)

houqp · 2021-10-08T04:10:15Z

datafusion/src/sql/planner.rs

+                let mut df_fields_with_alias = Vec::new();
+                for df_field in schema.fields().iter() {
+                    let df_field_with_alias = DFField::from_qualified(
+                        &alias.as_ref().unwrap().name.value,
+                        df_field.field().clone(),
+                    );
+                    df_fields_with_alias.push(df_field_with_alias);
+                }


it looks like df_fields_with_alias is not being used anymore? the schema patching should have been covered by project_with_alias right?

good sight!

houqp · 2021-10-08T04:39:57Z

datafusion/src/sql/planner.rs

+                            .map(|(field, ident)| {
+                                col_with_table_name(
+                                    field.name(),
+                                    &*(alias.clone().name.value),
+                                )
+                                .alias(ident.value.as_str())
+                            }),


After taking a closer look at the code, I now agree with @alamb that we don't need to do a 2nd round of patching here. The fields comes from plan.schema().fields(). The schema for the plan we referenced here has already been patched in the previous relation match block.

After thinking again, I agree. Because we wrapped a projection plan with alias in the previous TableFactor::Derived.

houqp

@xudong963 could you also update LogicalPlan's pub fn display method to print the projection alias? This should help make the logical plan more readable :)

xudong963 · 2021-10-08T05:00:21Z

@xudong963 could you also update LogicalPlan's pub fn display method to print the projection alias? This should help make the logical plan more readable :)

No problem

houqp

LGTM, great work @xudong963 ! I will let @alamb do the final merge when he wakes up :)

xudong963 · 2021-10-08T05:46:08Z

CI seems unstable.

=================================== FAILURES ===================================
_____________________________ test_math_functions ______________________________

df = <builtins.DataFrame object at 0x7f37c5055630>

    def test_math_functions(df):
        values = np.array([0.1, -0.7, 0.55])
        col_v = f.col("value")
        df = df.select(
            f.abs(col_v),
            f.sin(col_v),
            f.cos(col_v),
            f.tan(col_v),
            f.asin(col_v),
            f.acos(col_v),
            f.exp(col_v),
            f.ln(col_v + f.lit(1)),
            f.log2(col_v + f.lit(1)),
            f.log10(col_v + f.lit(1)),
            f.random(),
        )
>       result = df.collect()
E       Exception: DataFusion error: Plan("No field named '<unqualified>.0QsBN24Phk.value'. Valid fields are '0QsBN24Phk.value'.")

houqp · 2021-10-08T05:49:41Z

Yes, we can ignore that CI test for now.

francis-du · 2021-10-08T08:15:31Z

👍

alamb

Looks great. Thank you again @xudong963 !

alamb · 2021-10-08T13:59:13Z

datafusion/src/sql/planner.rs

-            ),
+                )?;
+                (
+                    project_with_alias(


github-actions bot added datafusion Changes in the datafusion crate sql SQL Planner labels Oct 3, 2021

xudong963 force-pushed the fix_join_alias branch from ff057a5 to fc3161e Compare October 3, 2021 16:10

houqp requested review from alamb, Dandandan and andygrove October 3, 2021 20:19

houqp added the bug Something isn't working label Oct 3, 2021

alamb reviewed Oct 4, 2021

View reviewed changes

xudong963 force-pushed the fix_join_alias branch from fc3161e to 6a79299 Compare October 7, 2021 08:37

github-actions bot added the ballista label Oct 7, 2021

alamb approved these changes Oct 7, 2021

View reviewed changes

xudong963 force-pushed the fix_join_alias branch 2 times, most recently from b0770be to 9113f8f Compare October 8, 2021 01:50

houqp reviewed Oct 8, 2021

View reviewed changes

xudong963 force-pushed the fix_join_alias branch from 9113f8f to 4986e40 Compare October 8, 2021 05:29

houqp approved these changes Oct 8, 2021

View reviewed changes

fix alias

4790322

xudong963 force-pushed the fix_join_alias branch from 4986e40 to 4790322 Compare October 8, 2021 05:38

alamb reviewed Oct 8, 2021

View reviewed changes

datafusion/src/sql/planner.rs

),

)?;

(

project_with_alias(

Copy link

Contributor

alamb Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer

alamb merged commit d331fa2 into apache:master Oct 8, 2021

xudong963 deleted the fix_join_alias branch October 12, 2021 14:13

andygrove mentioned this pull request May 2, 2022

Allow subqueries without aliases #2418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix subquery alias #1067

fix subquery alias #1067

xudong963 commented Oct 3, 2021

xudong963 commented Oct 3, 2021

xudong963 commented Oct 3, 2021 •

edited

Loading

houqp commented Oct 3, 2021

alamb left a comment

alamb Oct 4, 2021

alamb Oct 4, 2021

alamb Oct 4, 2021

xudong963 Oct 7, 2021

xudong963 commented Oct 6, 2021

houqp commented Oct 6, 2021

xudong963 commented Oct 7, 2021

alamb left a comment

alamb Oct 7, 2021

alamb Oct 7, 2021

alamb Oct 7, 2021

alamb commented Oct 7, 2021

houqp Oct 8, 2021

xudong963 Oct 8, 2021

houqp Oct 8, 2021

xudong963 Oct 8, 2021

houqp left a comment

xudong963 commented Oct 8, 2021

houqp left a comment

xudong963 commented Oct 8, 2021 •

edited

Loading

houqp commented Oct 8, 2021

francis-du commented Oct 8, 2021

alamb left a comment

alamb Oct 8, 2021

	"subquery in FROM must have an alias".parse().unwrap(),
	"subquery in FROM must have an alias".to_string(),

fix subquery alias #1067

fix subquery alias #1067

Conversation

xudong963 commented Oct 3, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

xudong963 commented Oct 3, 2021

xudong963 commented Oct 3, 2021 • edited Loading

houqp commented Oct 3, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Oct 6, 2021

houqp commented Oct 6, 2021

xudong963 commented Oct 7, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

xudong963 commented Oct 8, 2021

houqp left a comment

Choose a reason for hiding this comment

xudong963 commented Oct 8, 2021 • edited Loading

houqp commented Oct 8, 2021

francis-du commented Oct 8, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Oct 3, 2021 •

edited

Loading

xudong963 commented Oct 8, 2021 •

edited

Loading