-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: check ambiguous column reference #12467
Conversation
ebf5aa8
to
10d7171
Compare
fcdee8d
to
a3ef9a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @HuSen8891 -- this is a great contribution
I have one concern about handling Err
but otherwise looks good to me.
I think it would be good if someone else (perhaps @jonahgao ) also took a look at this PR to see if we are missing anything else
datafusion/common/src/dfschema.rs
Outdated
match column { | ||
Column { | ||
relation: Some(r), | ||
name: column_name, | ||
} => &r == qq && column_name == name, | ||
_ => false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to express this logic I would find clearer would be
match column { | |
Column { | |
relation: Some(r), | |
name: column_name, | |
} => &r == qq && column_name == name, | |
_ => false, | |
} | |
if let Some(r) = column.relation.as_ref() { | |
r == qq && column.name == name | |
} else { | |
false | |
} |
(though this is fine too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!Refine this part by the clearer way.
@@ -186,7 +186,22 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
let s = &ids[0..ids.len()]; | |||
// safe unwrap as s can never be empty or exceed the bounds | |||
let (relation, column_name) = form_identifier(s).unwrap(); | |||
Ok(Expr::Column(Column::new(relation, column_name))) | |||
// sanity check on column | |||
match schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are trying to avoid matching on Err
and converting to Ok
because each Err
results in at least one String
allocation (for the error message)
Since this code is potentially invoked many times during planning the allocations can add up and slow down planning
Is there any way to use a method that doesn't return Err
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result returned is not Err in most cases, we can ignore the additional cost here?
a3ef9a2
to
ef6d915
Compare
Thanks @HuSen8891. I plan to review it later today. |
@@ -186,7 +186,22 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
let s = &ids[0..ids.len()]; | |||
// safe unwrap as s can never be empty or exceed the bounds | |||
let (relation, column_name) = form_identifier(s).unwrap(); | |||
Ok(Expr::Column(Column::new(relation, column_name))) | |||
// sanity check on column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment says, how about adding a check_ambiguous_qualified_name
method to DFSchema
?
I think this would be more intuitive and simpler. The code here can be changed to the following.
// sanity check on column
schema.check_ambiguous_qualified_name(relation.as_ref(), column_name)?;
Ok(Expr::Column(Column::new(relation, column_name)))
check_ambiguous_qualified_name
is used to check whether the specified qualified name appears multiple times in the schema.
I suspect that we should not allow two identical qualified fields within a schema, but this would violate PR 6091. Therefore, introducing a method like check_ambiguous_qualified_name
to be called when needed might be a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That‘s good idea.
Add method 'check_ambiguous_name' to check ambiguous column reference.
ef6d915
to
b0e445a
Compare
@@ -412,17 +412,32 @@ impl DFSchema { | |||
} | |||
} | |||
|
|||
/// Check whether the column reference is ambiguous | |||
pub fn check_ambiguous_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the newly introduced functions qualified_field_with_qualified_name
and qualified_fields_with_qualified_name
do more than what we need. They might be unnecessary.
If I understand correctly, this function can be implemented as follows:
pub fn check_ambiguous_name(
&self,
qualifier: Option<&TableReference>,
name: &str,
) -> Result<()> {
let count = self
.iter()
.filter(|(field_q, f)| match (field_q, qualifier) {
(Some(q1), Some(q2)) => q1.resolved_eq(q2) && f.name() == name,
(None, None) => f.name() == name,
_ => false,
})
.take(2)
.count();
if count > 1 {
_schema_err!(SchemaError::AmbiguousReference {
field: Column {
relation: None,
name: name.to_string(),
},
})
} else {
Ok(())
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we no longer need to add these two functions🤔, qualified_field_with_qualified_name
and qualified_fields_with_qualified_name
, along with the modifications to qualified_field_with_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions will check the ambiguous column reference, it's necessary, or we could not identify the case below:
create table t1(v1 int);
select count(*)
from t1
right outer join t1
on t1.v1 > 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need add check_ambiguous_name
at L132.
The changes to qualified_field_with_name
will cause the following query to produce an incorrect result.
DataFusion CLI v41.0.0
> create table t1(v1 int) as values(100);
0 row(s) fetched.
> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+--------+
| t1[v1] |
+--------+
| foo |
+--------+
On the main branch, it gives
> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+-----+
| v1 |
+-----+
| 100 |
+-----+
This is because it breaks search_dfschema, making it unable to find t1.v1
, but instead found the nested column t1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column reference 't1.v1' is ambiguous in query.
select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
Fix it later and also add to test case.
b0e445a
to
6e92103
Compare
6e92103
to
27ce5eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me👍 Thanks @HuSen8891 ❤️
create table t1(v1 int) as values(100); | ||
|
||
## Query with Ambiguous column reference | ||
query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be an "Ambiguous reference to qualified field", but currently the SchemaError cannot describe it, so the current situation is acceptable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this error different from DuplicatedQualifiedField
?
https://github.com/apache/datafusion/blob/main/datafusion/common/src/error.rs#L149-L152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DuplicatedQualifiedField
was used to check the schema during its creation. After PR 6091, we allow duplicate qualified fields in the schema, so this error is no longer used.
We may check ambiguous names during datafusion/datafusion/common/src/dfschema.rs Lines 358 to 369 in eaf8d16
On the other hand, having a separate Let's merge it first, and then explore other improve possibilities later. |
on t1.v1 > 0; | ||
|
||
query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1 | ||
select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #12337
Rationale for this change
check ambiguous column reference in query
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?