Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can no longer easily join duplicate schemas as of version 43 #14112

Closed
Tracked by #14008
westonpace opened this issue Jan 13, 2025 · 2 comments · Fixed by #14127
Closed
Tracked by #14008

Can no longer easily join duplicate schemas as of version 43 #14112

westonpace opened this issue Jan 13, 2025 · 2 comments · Fixed by #14127
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@westonpace
Copy link
Member

Describe the bug

This may be expected behavior so feel free to close if this is intended. However, it is (potentially) different from postgres behavior and I figured I would mention it. The reproducer can probably explain the issue better than I can.

I'm able to work around the issue by renaming all fields on one of the inputs with a prefix but I didn't have to do this before and so I figured I'd report it and make sure the change is intentional.

To Reproduce

use arrow::array::{ArrayRef, Int32Array, RecordBatch};

use datafusion::prelude::*;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let ctx = SessionContext::new();

    let id: ArrayRef = Arc::new(Int32Array::from(vec![0, 1, 2]));
    let value: ArrayRef = Arc::new(Int32Array::from(vec![0, 1, 2]));
    let batch = RecordBatch::try_from_iter(vec![("id", id), ("value", value)]).unwrap();

    ctx.register_batch("tes", batch).unwrap();

    let id: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
    let value: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
    let batch = RecordBatch::try_from_iter(vec![("id", id), ("value", value)]).unwrap();

    ctx.register_batch("tes2", batch).unwrap();

    let tes = ctx.table("tes").await.unwrap();
    let tes2 = ctx.table("tes2").await.unwrap();

    // This succeeds (the two tables have different names and so the qualified names of the columns differ)
    let joined = tes
        .clone()
        .join(tes2, JoinType::Full, &["id"], &["id"], None)
        .unwrap();

    joined.show().await.unwrap();

    // This fails with the error:
    //
    // SchemaError(DuplicateQualifiedField { qualifier: Bare { table: "tes" }, name: "id" }, Some(""))
    let tes_clone = tes.clone();
    let joined = tes
        .join(tes_clone, JoinType::Full, &["id"], &["id"], None)
        .unwrap();

    joined.show().await.unwrap();
}

Expected behavior

I would expect both joins to succeed (they did in version 42).

Additional context

In postgres the closest I get is:

CREATE TABLE tes (id int, val int);
INSERT INTO tes (id, val) VALUES (0, 0), (1, 1), (2, 2);
CREATE TABLE tes2 (id int, val int);
INSERT INTO tes2 (id, val) VALUES (1, 1), (2, 2), (3, 3);
SELECT * FROM tes FULL OUTER JOIN tes2 ON tes.id = tes2.id;
SELECT * FROM tes as t1 FULL OUTER JOIN tes as t2 ON t1.id = t2.id;

It's not exactly the same as I have to alias tes. I will mention that my full motivation here is to support a join we do in lance during a merge_insert. We do a full outer join between the existing data (target table) and the new data (source table). Since these tables have the same schema and they are created with SessionContext::read_table they have the same name.

An alternative (and maybe simpler) fix would be to introduce a SessionContext::read_table_with_alias function which takes in an optional table name.

@westonpace westonpace added the bug Something isn't working label Jan 13, 2025
@alamb alamb added the regression Something that used to work no longer does label Jan 13, 2025
@alamb alamb mentioned this issue Jan 13, 2025
32 tasks
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I wonder if this could be related to this ticket that @jonahgao is working on

@jonahgao
Copy link
Member

jonahgao commented Jan 14, 2025

This is disallowed by #12608, because its output schema contains duplicate names and can lead to ambiguous references.

Postgres also does not allow self-join unless a different table alias is specified.

psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type "help" for help.

psql=> select * from t1 cross join t1;
ERROR:  table name "t1" specified more than once

psql=> select * from t1 cross join t1 t2;
 a | b | a | b
---+---+---+---
 1 | 1 | 1 | 1
(1 row)

I think we can add an alias() method for DataFrame. Similar to pyspark.sql.DataFrame.alias

impl DataFrame {
  pub fn alias(self, alias: &str) -> Result<DataFrame> {
        let plan = LogicalPlanBuilder::from(self.plan).alias(alias)?.build()?;
        Ok(DataFrame {
            session_state: self.session_state,
            plan,
        })
    }
}

Then add an alias before self-join:

let tes_clone = tes.clone().alias("tes2").unwrap();
    let joined = tes
        .join(tes_clone, JoinType::Full, &["id"], &["id"], None)
        .unwrap();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants