-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] connect: df.join
#3354
base: andrew/connect-with-columns-renamed
Are you sure you want to change the base?
[FEAT] connect: df.join
#3354
Conversation
andrewgazelka
commented
Nov 20, 2024
•
edited
Loading
edited
- fix
4ee6017
to
f40788c
Compare
863ac08
to
c999ae9
Compare
f40788c
to
d706ff8
Compare
c999ae9
to
394c381
Compare
d706ff8
to
264817c
Compare
394c381
to
3fb2d4f
Compare
264817c
to
252d2b4
Compare
3fb2d4f
to
4f49d30
Compare
252d2b4
to
ee02aa1
Compare
4f49d30
to
e95abbd
Compare
ee02aa1
to
22f9234
Compare
e95abbd
to
0e54699
Compare
22f9234
to
9375511
Compare
0e54699
to
cdcd749
Compare
9375511
to
7227e42
Compare
7350795
to
c52c7a0
Compare
db3b512
to
3394a66
Compare
c52c7a0
to
17f65ec
Compare
17f65ec
to
a56ba92
Compare
Graphite Automations"Notify author when CI fails" took an action on this PR • (11/21/24)1 teammate was notified to this PR based on Andrew Gazelka's automation. |
CodSpeed Performance ReportMerging #3354 will degrade performances by 54.24%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3354 +/- ##
=======================================
Coverage 77.35% 77.35%
=======================================
Files 685 686 +1
Lines 83631 83682 +51
=======================================
+ Hits 64695 64735 +40
- Misses 18936 18947 +11
|
None, | ||
None, | ||
None, | ||
false, // todo(correctness): we want join keys or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what spark expects here. Usually SQL will retain the join keys, and most dataframes will not. I'd try to perform a simple join with spark and see what happens.
if let Some(join_condition) = join_condition { | ||
bail!("Join conditions are not yet supported; use using_columns (join keys) instead; got {join_condition:?}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should definitely support both USING
and regular joins. They are mutually exclusive though, so we'll need to error if both are provided
bail!("Right side of join is required"); | ||
}; | ||
|
||
if let Some(join_condition) = join_condition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we'll need some complicated logic to split the join condition into left and right halves. Check out the SQL frontend. We're doing something similar to split the join condition
tests/connect/test_join.py
Outdated
def test_join(spark_session): | ||
# Create two DataFrames with overlapping IDs | ||
df1 = spark_session.range(5) | ||
df2 = spark_session.range(3, 7) | ||
|
||
# Perform inner join on 'id' column | ||
joined_df = df1.join(df2, "id", "inner") | ||
|
||
# Verify join results | ||
# assert joined_df.count() == 2, "Inner join should only return matching rows" | ||
|
||
# Convert to pandas to verify exact results | ||
joined_pandas = joined_df.toPandas() | ||
assert set(joined_pandas["id"].tolist()) == {3, 4}, "Inner join should only contain IDs 3 and 4" | ||
|
||
# Test left outer join | ||
left_joined_pandas = df1.join(df2, "id", "left").toPandas() | ||
assert set(left_joined_pandas["id"].tolist()) == {0, 1, 2, 3, 4}, "Left join should keep all rows from left DataFrame" | ||
|
||
# Test right outer join | ||
right_joined_pandas = df1.join(df2, "id", "right").toPandas() | ||
assert set(right_joined_pandas["id"].tolist()) == {3, 4, 5, 6}, "Right join should keep all rows from right DataFrame" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're going to need a lot more tests for joins. Check out the other join tests, both for dataframe and sql.
See this comment inline on Graphite.
bail!("Join type must be specified; got Unspecified") | ||
} | ||
JoinType::Inner => Ok(daft_core::join::JoinType::Inner), | ||
JoinType::FullOuter => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FullOuter
is just the same as Outer
} | ||
JoinType::LeftOuter => Ok(daft_core::join::JoinType::Left), // todo(correctness): is this correct? | ||
JoinType::RightOuter => Ok(daft_core::join::JoinType::Right), | ||
JoinType::LeftAnti => Ok(daft_core::join::JoinType::Anti), // todo(correctness): is this correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment. This is correct
JoinType::FullOuter => { | ||
bail!("Full outer joins not yet supported") // todo(completeness): add support for full outer joins if it is not already implemented | ||
} | ||
JoinType::LeftOuter => Ok(daft_core::join::JoinType::Left), // todo(correctness): is this correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment. this is correct
None, | ||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does spark suffix/prefix the duplicate join keys at all? We should verify and match what they do here.
dd5e05c
to
2c4cb41
Compare
|
||
impl From<daft_logical_plan::JoinType> for JoinTypeInfo { | ||
fn from(join_type: daft_logical_plan::JoinType) -> Self { | ||
JoinTypeInfo::Regular(join_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace JoinTypeInfo::Regular(join_type) with Self::Regular(join_type) to follow Rust style guidelines for impl From blocks
Spotted by Graphite Reviewer (based on CI logs)
Is this helpful? React 👍 or 👎 to let us know.
let result = match join_type { | ||
JoinTypeInfo::Cross => { | ||
left.cross_join(&right, None, None)? // todo(correctness): is this correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross joins are not supported in Daft yet. This should be replaced with a bail!() statement that indicates cross joins are not supported.
Spotted by Graphite Reviewer (based on CI logs)
Is this helpful? React 👍 or 👎 to let us know.
bb9a07f
to
e5c3a8d
Compare
2c4cb41
to
02a0e34
Compare
|
||
|
||
|
||
def test_cross_join(spark_session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support cross joins yet
e5c3a8d
to
af8b7ca
Compare
02a0e34
to
1b0de2e
Compare