Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IGNORE NULLS for FIRST/LAST window function #9470

Merged
merged 11 commits into from
Mar 12, 2024

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes #.
Related #9055

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Mar 5, 2024
@huaxingao huaxingao marked this pull request as draft March 5, 2024 18:08
) -> Result<Self> {
if ignore_nulls {
return exec_err!("NTH_VALUE ignore_nulls is not supported yet");
Copy link
Contributor Author

@huaxingao huaxingao Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a separate PR for this soon.

@huaxingao huaxingao marked this pull request as ready for review March 5, 2024 23:43
@huaxingao huaxingao closed this Mar 5, 2024
@huaxingao huaxingao reopened this Mar 5, 2024
@huaxingao
Copy link
Contributor Author

cc @viirya @comphead @alamb @mustafasrepo
Could you please review this PR when you have a moment? Thanks a lot!

@viirya
Copy link
Member

viirya commented Mar 6, 2024

Thanks @huaxingao. I'll review this in few days.

) -> Result<Self> {
// If ignoring nulls, find the next non-null index.
if ignore_nulls {
while index < (array.len() - 1) && !array.is_valid(index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably cheaper to get all null indices first? and then find index in the array?

array.nulls().unwrap().valid_indices().collect::<Vec<_>>()

@huaxingao huaxingao closed this Mar 7, 2024
@huaxingao huaxingao reopened this Mar 7, 2024
Comment on lines 4331 to 4335
----
4
4
4
4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect, result of this query to be

----
NULL
NULL
4
4

ran it on duck db with following commands

CREATE TABLE t AS VALUES  (3, 4), (4, 3), (null::bigint, 1), (null::bigint, 1);

SELECT FIRST_VALUE(col0 ignore nulls) OVER(ORDER BY col1) FROM t;

duckdb produces result above.


let mut valid_indices = Vec::new();
if self.ignore_nulls {
valid_indices = arr.nulls().unwrap().valid_indices().collect::<Vec<_>>();
Copy link
Contributor

@mustafasrepo mustafasrepo Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the unexpected result I mentioned is related to here. You might need calculate valid_indices within the range not on whole arr. Also, maybe offset handling will be needed on valid_indices onwards once calculation is done within the range.

3

query I
SELECT FIRST_VALUE(column1) OVER(ORDER BY column2) FROM t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also queries that include ORDER BY <expr> might as well include <expr> in the SELECT. That way query is much more readable, and we can keep track expected behaviour from the result. However, this is purely stylistic.

@@ -42,6 +42,7 @@ pub struct NthValue {
/// Output data type
data_type: DataType,
kind: NthValueKind,
ignore_nulls: bool,
Copy link
Member

@viirya viirya Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also for NTH_VALUE right? Maybe add it to the title?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm, saw your comment below.

NthValueKind::Last => ScalarValue::try_from_array(arr, range.end - 1),
NthValueKind::First => {
let index: usize = if self.ignore_nulls {
valid_indices[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @mustafasrepo said, you will pick the first non-null value in the array, not the first non-null value in the range of the array.

}
NthValueKind::Last => {
let index = if self.ignore_nulls {
valid_indices[valid_indices.len() - 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@huaxingao huaxingao force-pushed the ignore_nulls_window_first branch from 9d9364f to fd82234 Compare March 10, 2024 01:28
NULL 1

# query II
# SELECT LAST_VALUE(column1) IGNORE NULLS OVER(ORDER BY column2 DESC NULLS LAST), column2 FROM t;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is currently failing and I got

6 6
5 5
3 4
4 3
4 2
NULL 1

I suspect it's because the code is existing here. For OVER(ORDER BY column2 DESC NULLS LAST), i.e. OVER (ORDER BY column2 DESC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), should is_causal be true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I examined this problem. It seemed like it is related to memoization. Sent a commit to fix this test case (Uncommented the test also). Thanks @huaxingao for triggering this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mustafasrepo for your help!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @huaxingao -- this looks great to me.

Thanks to everyone else who review and contributed comments and fixes 🙏

NULL 4

query II
SELECT FIRST_VALUE(column1) IGNORE NULLS OVER(ORDER BY column2), column2 FROM t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "all nulls" case is very crazy. Nice work testing

@huaxingao huaxingao force-pushed the ignore_nulls_window_first branch from ecf8d44 to 7c7aa8b Compare March 12, 2024 00:40
@alamb alamb merged commit 92d046d into apache:main Mar 12, 2024
23 checks passed
@huaxingao
Copy link
Contributor Author

Thank you all very much for helping me with this PR!

@huaxingao huaxingao deleted the ignore_nulls_window_first branch March 12, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants