Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 #1064

Closed
omegablitz opened this issue Sep 29, 2021 · 6 comments · Fixed by #1088
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@omegablitz
Copy link

Thanks for the great library! Ran into the following issue while testing out some queries:

Describe the bug
UNION ALL does not give expected results when combining two tables. The minimal reproduction provided below panics.

To Reproduce
Minimal repro available here:
https://github.com/omegablitz/datafusion-bug

Expected behavior
I expect the output to be:

SUM(c) == 30
SUM(d) == 300

Instead, this example panics.

Additional context
If I explicitly change:

SELECT a, b, c, 0.0 AS d FROM table_1
UNION ALL
SELECT a, b, 0.0 AS c, d FROM table_2

to:

SELECT a AS a, b AS b, c AS c, 0.0 AS d FROM table_1
UNION ALL
SELECT a AS a, b AS b, 0.0 AS c, d AS d FROM table_2

it works

@omegablitz omegablitz added the bug Something isn't working label Sep 29, 2021
@houqp houqp added the help wanted Extra attention is needed label Sep 29, 2021
@xudong963
Copy link
Member

Thanks for your report !@omegablitz
Maybe union alias in datafusion is processed incorrectly. Recently I am following alias question in #1049. I will check your case later.

@omegablitz
Copy link
Author

This issue is actually reproducible with the following (simpler) query:

SELECT SUM(d)
FROM (
  SELECT c, c AS d FROM table_1
  UNION ALL
  SELECT c, c AS d FROM table_1
)

@Dandandan
Copy link
Contributor

Full trace

> SELECT SUM(d)
FROM (
  SELECT 1 as c, 2 as d
  UNION ALL
  SELECT 1 as c, 3 AS d
);
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
   1: core::panicking::panic_fmt
             at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/panicking.rs:92:14
   2: core::panicking::panic_bounds_check
             at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/panicking.rs:69:5
   3: arrow::datatypes::schema::Schema::field
   4: <datafusion::physical_plan::expressions::column::Column as datafusion::physical_plan::PhysicalExpr>::data_type
   5: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   6: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   7: core::iter::adapters::process_results
   8: datafusion::physical_plan::type_coercion::coerce
   9: datafusion::physical_plan::aggregates::create_aggregate_expr
  10: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_aggregate_expr
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
  12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
  13: core::iter::adapters::process_results
  14: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan
  15: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan
  16: <datafusion::physical_plan::planner::DefaultPhysicalPlanner as datafusion::physical_plan::planner::PhysicalPlanner>::create_physical_plan
  17: <datafusion::execution::context::DefaultQueryPlanner as datafusion::execution::context::QueryPlanner>::create_physical_plan
  18: datafusion::execution::context::ExecutionContext::create_physical_plan
  19: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  21: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  22: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  23: tokio::runtime::thread_pool::ThreadPool::block_on
  24: tokio::runtime::Runtime::block_on
  25: datafusion_cli::main

@alamb alamb changed the title UNION ALL bug UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 Oct 2, 2021
@alamb
Copy link
Contributor

alamb commented Oct 2, 2021

This one looks like it might be a good one for someone who wants to understand schemas and column references

@xudong963
Copy link
Member

Please assign it to me. I am doing #1029 and #1049 which could help me solve this issue. Thanks @alamb

@xudong963
Copy link
Member

xudong963 commented Oct 7, 2021

Bug located at https://github.com/apache/arrow-datafusion/blob/4687899957463ce81c4795a6d35d31320db0252b/datafusion/src/physical_plan/planner.rs#L836

input_dfschema is from the logical input schema, so idx of the column is from the logical input schema.

The idx is wrapped in physical expr and is used in https://github.com/apache/arrow-datafusion/blob/4687899957463ce81c4795a6d35d31320db0252b/datafusion/src/physical_plan/type_coercion.rs#L56

Pay attention to the schema, which is from the physical input schema. So when the size of the logical input schema is different from the size of the physical input schema, the bug appears.

The direct way from my brain is to get the idx of the column from the physical input schema, let idx = input_schema.index_of(c.name.as_str())?;. But sometimes column, logical input schema field name, and physical input schema field name are not same, such as the following case:

select
    sum(l_extendedprice * l_discount) as revenue
from
    lineitem
where
        l_shipdate >= date '1994-01-01'
  and l_shipdate < date '1995-01-01'
  and l_discount between 0.06 - 0.01 and 0.06 + 0.01
  and l_quantity < 24;
[datafusion/src/physical_plan/planner.rs:836] c = Column {
    relation: None,
    name: "SUM(lineitem.l_extendedprice * lineitem.l_discount)",
}
[datafusion/src/physical_plan/planner.rs:837] input_dfschema = DFSchema {
    fields: [
        DFField {
            qualifier: None,
            field: Field {
                name: "SUM(lineitem.l_extendedprice * lineitem.l_discount)",
                data_type: Float64,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: None,
            },
        },
    ],
}
[datafusion/src/physical_plan/planner.rs:838] input_schema = Schema {
    fields: [
        Field {
            name: "SUM(lineitem.l_extendedprice Multiply lineitem.l_discount)",
            data_type: Float64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: None,
        },
    ],
    metadata: {},
}

The second way is to wrap the union logical plan into a projection plan, but maybe the logical plan will be optimized. For the case mentioned by @Dandandan, the projection plan wrapped on the union logical plan will be optimized and only contains d. So finally there is still a bug...

The third way is trying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
5 participants