Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to DataFusion 17.0.0 #998

Merged
merged 44 commits into from
Feb 13, 2023
Merged

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jan 18, 2023

Changes in DataFusion that are causing tests to fail:

  • Window frame offsets can have different data types depending on the context but Dask SQL currently expects UInt64
  • Join conditions are no longer limited to Column but can be any Expr

@andygrove
Copy link
Contributor Author

andygrove commented Jan 18, 2023

current failures:

FAILED tests/integration/test_compatibility.py::test_single_agg_count_no_group_by - ValueError: conflicting aggregation functions: [('count', 'a'), ('count', 'a')]
FAILED tests/integration/test_compatibility.py::test_multi_agg_count_no_group_by - ValueError: conflicting aggregation functions: [('count', 'a'), ('count', 'a')]
FAILED tests/integration/test_compatibility.py::test_agg_count_distinct_group_by - AssertionError: DataFrame.iloc[:, 1] (column name="cd_b") are different
FAILED tests/integration/test_compatibility.py::test_agg_count_distinct_no_group_by - AssertionError: DataFrame.iloc[:, 0] (column name="cd_a") are different
FAILED tests/integration/test_join.py::test_join_literal - pyo3_runtime.PanicException: assertion failed: !filters.is_empty()
FAILED tests/integration/test_join.py::test_join_on_unary_cond_only - pyo3_runtime.PanicException: assertion failed: !filters.is_empty()
FAILED tests/integration/test_join.py::test_intersect - TypeError: "unsupported join condition"

Comment on lines 168 to 179
match &self.frame_bound {
WindowFrameBound::Preceding(val) | WindowFrameBound::Following(val) => match val {
x if x.is_null() => Ok(None),
ScalarValue::UInt64(v) => Ok(*v),
ScalarValue::Int64(v) => Ok(v.map(|n| n as u64)),
ref x => Err(DaskPlannerError::Internal(format!(
"Unexpected window frame bound: {:?}",
x
))
.into()),
},
// The below is only safe because window bounds cannot be negative
Copy link
Collaborator

@ayushdg ayushdg Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a better way of writing the same logic win rust. cc @andygrove if you have suggestions

Comment on lines +14 to +41
class RexAliasPlugin(BaseRexPlugin):
"""
A RexAliasPlugin is an expression, which references a Subquery.
This plugin is thin on logic, however keeping with previous patterns
we use the plugin approach instead of placing the logic inline
"""

class_name = "RexAlias"

def convert(
self,
rel: "LogicalPlan",
rex: "Expression",
dc: DataContainer,
context: "dask_sql.Context",
) -> Union[dd.Series, Any]:
# extract the operands; there should only be a single underlying Expression
operands = rex.getOperands()
assert len(operands) == 1

sub_rex = operands[0]

value = RexConverter.convert(rel, sub_rex, dc, context=context)

if isinstance(value, DataContainer):
return value.df

return value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to resolve the query failures coming from our lack of handling for rex aliases; for now I've kept this separate from the relatively similar RexScalarSubqueryPlugin, not sure if there's a case to have these merged into one plugin that then conditionally grabs and converts the sub-rel/rex

@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2023

Codecov Report

Merging #998 (4cadf93) into main (4f7e7f3) will increase coverage by 0.06%.
The diff coverage is 84.37%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main     #998      +/-   ##
==========================================
+ Coverage   81.09%   81.16%   +0.06%     
==========================================
  Files          77       78       +1     
  Lines        4338     4358      +20     
  Branches      779      782       +3     
==========================================
+ Hits         3518     3537      +19     
+ Misses        648      643       -5     
- Partials      172      178       +6     
Impacted Files Coverage Δ
dask_sql/physical/rex/convert.py 88.00% <ø> (ø)
dask_sql/physical/rex/core/alias.py 72.22% <72.22%> (ø)
dask_sql/context.py 94.23% <100.00%> (-1.22%) ⬇️
dask_sql/physical/rel/logical/join.py 96.03% <100.00%> (+14.58%) ⬆️
dask_sql/physical/rel/logical/window.py 95.32% <100.00%> (-1.79%) ⬇️
dask_sql/physical/rex/core/__init__.py 100.00% <100.00%> (ø)
dask_sql/physical/rex/core/subquery.py 57.14% <100.00%> (ø)
dask_sql/physical/rex/core/literal.py 58.09% <0.00%> (-2.86%) ⬇️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jdye64 jdye64 changed the title Upgrade to DataFusion 16.0.0 Upgrade to DataFusion 17.0.0 Feb 9, 2023
@jdye64 jdye64 marked this pull request as ready for review February 12, 2023 21:58
Copy link
Collaborator

@jdye64 jdye64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This was a long haul of a PR

.github/workflows/conda.yml Outdated Show resolved Hide resolved
continuous_integration/recipe/meta.yaml Show resolved Hide resolved
dask_planner/src/sql/optimizer.rs Show resolved Hide resolved
Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove and @jdye64 for taking the time to unblock CI 🎉 couple of comments/questions but things mostly look good:

.github/workflows/conda.yml Outdated Show resolved Hide resolved
.github/workflows/conda.yml Outdated Show resolved Hide resolved
continuous_integration/recipe/meta.yaml Outdated Show resolved Hide resolved
[
[("a", "==", 1), ("b", "<", 10)],
[("a", "==", 1), ("b", ">", 5)],
[("b", ">", 5), ("b", "<", 10)],
[("a", "==", 1)],
],
[[("b", ">", 5), ("b", "<", 10)], [("a", "==", 1)]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking; are these param changes related to disabling FilterColumnsPostJoin?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why those changed? I don't think it was me that changed them unless I just don't remember. It should not be related to FilterColumnsPostJoin however.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say the new change seems more correct than the original way anyway.

Copy link
Collaborator

@ayushdg ayushdg Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can comment a bit more on this. This isn't related to FilterColumnsPostJoin, but how datafusion optimizes predicates for filters/table_scan operations. Starting df14, they started using cnf (conjunctive normal form) for the expressions being passed as filters. (See #903 (comment)).
Since this particular test already had filters in dnf, the conversion of dnf->cnf (by df) -> dnf (by dask-sql since that's what arrow expects) resulted in an overly complex filter.

df16/17 seems to simplify some of these expressions and doesn't convert everything to cnf.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm! Just minor comments/questions that should be a quick fix

dask_planner/Cargo.toml Outdated Show resolved Hide resolved
optimizer: Optimizer,
}

impl DaskSqlOptimizer {
/// Creates a new instance of the DaskSqlOptimizer with all the DataFusion desired
/// optimizers as well as any custom `OptimizerRule` trait impls that might be desired.
pub fn new(skip_failing_rules: bool) -> Self {
pub fn new() -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we removed this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That configuration value is no longer exposed via datafusion therefore this was "dead_code" and the compiler issuing warnings and cargo clippy was failing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah they might have renamed that one to skip_failed_rules. https://docs.rs/datafusion-common/17.0.0/datafusion_common/config/struct.OptimizerOptions.html

We can update in a followup if needed.

tests/integration/test_join.py Outdated Show resolved Hide resolved
@jdye64
Copy link
Collaborator

jdye64 commented Feb 13, 2023

@ayushdg how do the changes suit you?

@jdye64 jdye64 merged commit f895346 into dask-contrib:main Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants