Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame::union() does not detect schema mismatches #13287

Closed
ttencate opened this issue Nov 7, 2024 · 3 comments
Closed

DataFrame::union() does not detect schema mismatches #13287

ttencate opened this issue Nov 7, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@ttencate
Copy link

ttencate commented Nov 7, 2024

Describe the bug

Using datafusion version 42.2.0.

Follow up to #13092, which was fixed by #13117 thanks to @Omega359.

However, this fix will not catch mistakes like reordered columns. For example, if table A has columns a, b and table B has columns b, a, then DataFusion will happily compute the union, with the wrong values in the wrong columns.

So why not just compare the entire schema? Or at least the column names and types (i.e. ignoring metadata)? The docs explicitly say that the schemas must be equal.

To Reproduce

#[tokio::test]
async fn test_union() {
    use crate::data_frame;
    use datafusion::assert_batches_sorted_eq;
    use datafusion::common::arrow::array::{ArrayRef, StringArray};
    use datafusion::common::arrow::record_batch::RecordBatch;
    use std::sync::Arc;

    let ctx = SessionContext::new();
    let a = ctx
        .read_batch(
            RecordBatch::try_from_iter([
                ("a", Arc::new(StringArray::from(vec!["a"])) as ArrayRef),
                ("b", Arc::new(StringArray::from(vec!["b"])) as ArrayRef),
            ])
            .unwrap(),
        )
        .unwrap();
    let b = ctx
        .read_batch(
            RecordBatch::try_from_iter([
                ("b", Arc::new(StringArray::from(vec!["b"])) as ArrayRef),
                ("a", Arc::new(StringArray::from(vec!["a"])) as ArrayRef),
            ])
            .unwrap(),
        )
        .unwrap();

    let union = a.union(b).unwrap();
    assert_batches_sorted_eq!(
        [
            "+---+---+",
            "| a | b |",
            "+---+---+",
            "| a | b |",
            "| a | b |",
            "+---+---+",
        ],
        &union.collect().await.unwrap()
    );
}

Expected behavior

Test passes.

Actual behavior

assertion `left == right` failed: 

expected:

[
    "+---+---+",
    "| a | b |",
    "+---+---+",
    "| a | b |",
    "| a | b |",
    "+---+---+",
]
actual:

[
    "+---+---+",
    "| a | b |",
    "+---+---+",
    "| a | b |",
    "| b | a |",
    "+---+---+",
]
@ttencate ttencate added the bug Something isn't working label Nov 7, 2024
@Omega359
Copy link
Contributor

Omega359 commented Nov 7, 2024

The expected behaviour in your test is to fail, not to pass I think according to your comments.

@Omega359
Copy link
Contributor

Omega359 commented Nov 7, 2024

I think this is actually working correctly and the docs might need to be updated. I do not think union requires the fields to be the same nor have the same names. It just requires the same number of fields and each corresponding field to be coercible to a common type.

create table t1 (a,  b)  AS
VALUES
  ('a'::varchar, 'b'::varchar),
  ('c', 'd') ;

create table t2 (c, d)  AS select b, a from t1;

select a, b from t1;
select c, d from t2;
select a, b from t1 union select c, d from t2;

output:

SELECT 2
SELECT 2
 a | b 
---+---
 a | b
 c | d
(2 rows)

 c | d 
---+---
 b | a
 d | c
(2 rows)

 a | b 
---+---
 d | c
 a | b
 b | a
 c | d

https://onecompiler.com/postgresql/42xehmfuu

https://www.postgresql.org/docs/current/typeconv-union-case.html

@ttencate
Copy link
Author

ttencate commented Nov 7, 2024

Wow, yeah, I didn't realise that SQL did it the same way.

It's still a footgun though. In general, column indices rarely matter, but column names do. Here, it's just the opposite.

Closing in favour of #12650. Thanks for your consideration!

@ttencate ttencate closed this as completed Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants