Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IGNORE NULLS for FIRST_VALUE #9411

Merged
merged 11 commits into from
Mar 5, 2024
Merged

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes #.
Related #9055

Rationale for this change

Spark has ignore null in First_value, so we want to support this option too.
The logic for ignore null in Last_value is similar. I will have a separate PR for ignore null in Last_value

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Mar 1, 2024
Ok((!indices.is_empty()).then_some(indices.value(0) as _))

if self.is_ignore_null {
let indices = lexsort_to_indices(&sort_columns, Some(value.len()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead you can use

Suggested change
let indices = lexsort_to_indices(&sort_columns, Some(value.len()))?;
let indices = lexsort_to_indices(&sort_columns, None)?;

to be more explicit about we are sorting whole array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

)
}

pub fn with_ignore_null(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this API, maybe we can use builder API as below

pub fn with_ignore_null(mut self, ignore_null: bool) -> Self{
    self.ignore_null = ignore_null;
    self
}

However, this purely stylistic. Feel free to proceed as you wish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks for the suggestion!

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Other than these this PR is LGTM!. Thanks @huaxingao for this PR.

@@ -213,6 +235,7 @@ struct FirstValueAccumulator {
ordering_req: LexOrdering,
// Stores whether incoming data already satisfies the ordering requirement.
requirement_satisfied: bool,
is_ignore_null: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to add a comment like above parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks

// If ignoring nulls, find the first non-null value.
for i in 0..value.len() {
if !value.is_null(i) {
return Ok((!value.is_empty()).then_some(i));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you enter this loop, I think value is not empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks

// If ignoring nulls, find the first non-null value.
for index in indices.iter().flatten() {
if !value.is_null(index as usize) {
return Ok((!value.is_empty()).then_some(index as usize));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I saw you implement it in physical expression FirstValue. But do you add ignore null parameter to logical one and pass it through logical to physical? I.e., does a query with first value + ignore null work now? Can you add an e2e test to sqllogictests?

@@ -44,6 +44,7 @@ pub struct FirstValue {
expr: Arc<dyn PhysicalExpr>,
ordering_req: LexOrdering,
requirement_satisfied: bool,
is_ignore_null: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use ignore_nulls: bool, it is consistent with SQL keywords and also window functions that in parallel implementing IGNORE NULLS .

Would be also nice to have a comments in struct about this field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao
Would be that possible to add tests to aggregate.slt?

@comphead
Copy link
Contributor

comphead commented Mar 1, 2024

Hmm, I saw you implement it in physical expression FirstValue. But do you add ignore null parameter to logical one and pass it through logical to physical? I.e., does a query with first value + ignore null work now? Can you add an e2e test to sqllogictests?

It should start with logical plan, from #9221 the null treatment is allowed for window functions now https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/function.rs#L61

For agg I suppose we need to do the same

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait labels Mar 2, 2024
@huaxingao huaxingao force-pushed the ignore_null_first branch 3 times, most recently from 9cc5ce3 to 366e7ef Compare March 2, 2024 19:11
@huaxingao huaxingao force-pushed the ignore_null_first branch from 3e84596 to b3f8952 Compare March 2, 2024 19:24
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 3, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @huaxingao -- this looks good to me. I had some style suggestions but nothing that would prevent the PR from merging in my mind

@@ -246,6 +246,7 @@ fn create_physical_name(e: &Expr, is_first_expr: bool) -> Result<String> {
args,
filter,
order_by,
..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As as style thing I have found it helpful to not use a .. match all, and instead explicitly ignore each field like

Suggested change
..
null_treatment: _,

With this treatment, if/when someone adds a new field to AggregateFunction in the future, the compiler will tell them all the places that may need to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for the suggestion!

@@ -44,6 +44,7 @@ pub struct FirstValue {
expr: Arc<dyn PhysicalExpr>,
ordering_req: LexOrdering,
requirement_satisfied: bool,
ignore_null: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very minor nit is that some parts of this PR use ignore_null and some parts use ignore_nulls (with an s). It might be nice to make them uniform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have change all of them to ignore_nulls. Thanks for catching this!

@@ -672,7 +672,7 @@ pub fn to_substrait_agg_measure(
),
) -> Result<Measure> {
match expr {
Expr::AggregateFunction(expr::AggregateFunction { func_def, args, distinct, filter, order_by }) => {
Expr::AggregateFunction(expr::AggregateFunction { func_def, args, distinct, filter, order_by, .. }) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about matching all vs explicit field matches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@@ -544,6 +544,7 @@ pub struct AggregateFunction {
pub filter: Option<Box<Expr>>,
/// Optional ordering
pub order_by: Option<Vec<Expr>>,
pub null_treatment: Option<NullTreatment>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment as other parameters did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks

..
}) => {
fmt_function(f, func_def.name(), *distinct, args, true)?;
if let Some(nt) = null_treatment {
write!(f, "{}", nt)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
write!(f, "{}", nt)?;
write!(f, " {}", nt)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks

@@ -1823,6 +1832,9 @@ fn create_name(e: &Expr) -> Result<String> {
if let Some(order_by) = order_by {
info += &format!(" ORDER BY [{}]", expr_vec_fmt!(order_by));
};
if let Some(nt) = null_treatment {
info += &format!("{}", nt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info += &format!("{}", nt);
info += &format!(" {}", nt);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks

}
}

pub fn ignore_null(mut self, ignore_null: bool) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a builder style API, a common API name pattern should be (e.g., the existing with_requirement_satisfied):

Suggested change
pub fn ignore_null(mut self, ignore_null: bool) -> Self {
pub fn with_ignore_null(mut self, ignore_null: bool) -> Self {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have a few minor comments. Thanks @huaxingao.

@huaxingao
Copy link
Contributor Author

Test failed:

External error: query result mismatch:
[SQL] EXPLAIN SELECT DISTINCT i FROM t1000;
[Diff] (-expected|+actual)
    logical_plan
    Aggregate: groupBy=[[t1000.i]], aggr=[[]]
    --TableScan: t1000 projection=[i]
    physical_plan
    AggregateExec: mode=FinalPartitioned, gby=[i@0 as i], aggr=[]
    --CoalesceBatchesExec: target_batch_size=8192
    ----RepartitionExec: partitioning=Hash([i@0], 4), input_partitions=4
    ------AggregateExec: mode=Partial, gby=[i@0 as i], aggr=[]
-   --------MemoryExec: partitions=4, --------MemoryExec: partitions=4, partition_sizes=[1, 1, 2, 1]
+   --------MemoryExec: partitions=4, partition_sizes=[1, 2, 1, 1]
at test_files/limit.slt:392

partition_sizes=[1, 2, 1, 1] should also be correct, right? I got partition_sizes=[1, 1, 2, 1] when running the test on my local.

@@ -3214,7 +3214,7 @@ JOIN sales_global AS e
ON s.currency = e.currency AND
s.ts >= e.ts
GROUP BY s.sn, s.zip_code, s.country, s.ts, s.currency
ORDER BY s.sn
ORDER BY s.sn, s.zip_code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result is

0 GRC 0 2022-01-01T06:00:00 EUR 30
1 FRA 1 2022-01-01T08:00:00 EUR 50
1 TUR 2 2022-01-01T11:30:00 TRY 75
1 FRA 3 2022-01-02T12:00:00 EUR 200
0 GRC 4 2022-01-03T10:00:00 EUR 80
1 TUR 4 2022-01-03T10:00:00 TRY 100

but my test has

0 GRC 0 2022-01-01T06:00:00 EUR 30
1 FRA 1 2022-01-01T08:00:00 EUR 50
1 TUR 2 2022-01-01T11:30:00 TRY 75
1 FRA 3 2022-01-02T12:00:00 EUR 200
1 TUR 4 2022-01-03T10:00:00 TRY 100
0 GRC 4 2022-01-03T10:00:00 EUR 80

I think the above is also legal. I added one more column in ORDER BY to make the returned order deterministic.

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

I believe the CI failures were due to a logical conflict on main, fixed in #9444

So merging up from main will likely fix them

Update: I think this is unrelated -- I filed #9450 to track

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

Closing / reopening to trigger CI

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @huaxingao -- I am trying to clear the PR backlog to keep the code in this repo moving. I noticed this PR had conflicts due to #8891, so I took the liberty of merging up from main to resolve the conflicts (and hopefully get this PR ready to merge)

Thanks again for the contribution and I am sorry this one has not yet merged.

@huaxingao
Copy link
Contributor Author

Thanks @alamb

@alamb alamb merged commit 3aba67e into apache:main Mar 5, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 5, 2024

🚀

@huaxingao
Copy link
Contributor Author

Thank you all very much for helping me with this PR! @alamb @comphead @mustafasrepo @viirya

@huaxingao huaxingao deleted the ignore_null_first branch March 5, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants