-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IGNORE NULLS for FIRST/LAST window function #9470
Conversation
) -> Result<Self> { | ||
if ignore_nulls { | ||
return exec_err!("NTH_VALUE ignore_nulls is not supported yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a separate PR for this soon.
cc @viirya @comphead @alamb @mustafasrepo |
Thanks @huaxingao. I'll review this in few days. |
datafusion/common/src/scalar/mod.rs
Outdated
) -> Result<Self> { | ||
// If ignoring nulls, find the next non-null index. | ||
if ignore_nulls { | ||
while index < (array.len() - 1) && !array.is_valid(index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is probably cheaper to get all null indices first? and then find index in the array?
array.nulls().unwrap().valid_indices().collect::<Vec<_>>()
---- | ||
4 | ||
4 | ||
4 | ||
4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect, result of this query to be
----
NULL
NULL
4
4
ran it on duck db with following commands
CREATE TABLE t AS VALUES (3, 4), (4, 3), (null::bigint, 1), (null::bigint, 1);
SELECT FIRST_VALUE(col0 ignore nulls) OVER(ORDER BY col1) FROM t;
duckdb produces result above.
|
||
let mut valid_indices = Vec::new(); | ||
if self.ignore_nulls { | ||
valid_indices = arr.nulls().unwrap().valid_indices().collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the unexpected result I mentioned is related to here. You might need calculate valid_indices within the range
not on whole arr
. Also, maybe offset handling will be needed on valid_indices
onwards once calculation is done within the range
.
3 | ||
|
||
query I | ||
SELECT FIRST_VALUE(column1) OVER(ORDER BY column2) FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also queries that include ORDER BY <expr>
might as well include <expr>
in the SELECT
. That way query is much more readable, and we can keep track expected behaviour from the result. However, this is purely stylistic.
@@ -42,6 +42,7 @@ pub struct NthValue { | |||
/// Output data type | |||
data_type: DataType, | |||
kind: NthValueKind, | |||
ignore_nulls: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also for NTH_VALUE
right? Maybe add it to the title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm, saw your comment below.
NthValueKind::Last => ScalarValue::try_from_array(arr, range.end - 1), | ||
NthValueKind::First => { | ||
let index: usize = if self.ignore_nulls { | ||
valid_indices[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @mustafasrepo said, you will pick the first non-null value in the array, not the first non-null value in the range of the array.
} | ||
NthValueKind::Last => { | ||
let index = if self.ignore_nulls { | ||
valid_indices[valid_indices.len() - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
9d9364f
to
fd82234
Compare
NULL 1 | ||
|
||
# query II | ||
# SELECT LAST_VALUE(column1) IGNORE NULLS OVER(ORDER BY column2 DESC NULLS LAST), column2 FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is currently failing and I got
6 6
5 5
3 4
4 3
4 2
NULL 1
I suspect it's because the code is existing here. For OVER(ORDER BY column2 DESC NULLS LAST)
, i.e. OVER (ORDER BY column2 DESC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
, should is_causal
be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I examined this problem. It seemed like it is related to memoization
. Sent a commit to fix this test case (Uncommented the test also). Thanks @huaxingao for triggering this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mustafasrepo for your help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @huaxingao -- this looks great to me.
Thanks to everyone else who review and contributed comments and fixes 🙏
NULL 4 | ||
|
||
query II | ||
SELECT FIRST_VALUE(column1) IGNORE NULLS OVER(ORDER BY column2), column2 FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "all nulls" case is very crazy. Nice work testing
ecf8d44
to
7c7aa8b
Compare
Thank you all very much for helping me with this PR! |
Which issue does this PR close?
Closes #.
Related #9055
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?