Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push SortOptions into DynComparator Allowing Nested Comparisons (#5426) #5792

Merged
merged 5 commits into from
May 28, 2024

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #5426
Closes #5199
Closes #5559

Rationale for this change

See tickets

What changes are included in this PR?

This pushes null handling out of LexicographicalComparator and into DynComparator. This not only closes off a potential footgun, but also allows supporting nested types in comparison operators.

This probably needs a few more tests prior to merge, in particular StructArray was not supported by LexicographicalComparator before and therefore needs new tests, but the broad meat of the PR I think is ready for review

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 21, 2024
/// # use arrow_ord::cmp;
/// # use arrow_ord::ord::make_comparator;
/// # use arrow_schema::{ArrowError, SortOptions};
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated adding SortOptions to the existing comparison kernels, but that would not only be breaking change, but is also somewhat incoherent. Whilst I accept postgres and some other systems do this, it is rather surprising, at least to me.

Having them separate also makes it more obvious what has an optimised vectorised kernel and what doesn't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the API choice 👍

I recommend adding the rationale to the docstring at the stop ("this function does not support overrriding NULL handling to ensure optimized kernels, see make_comparator for more control over ordering)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring talks about comparsions and total order and this function implement eq -- woudl it be better ti implement cmp to better match the text?

@tustvold tustvold changed the title Push SortOptions into DynComparator (#5426) Push SortOptions into DynComparator Allowing Nested Comparisons (#5426) May 21, 2024
@tustvold tustvold force-pushed the sort-options-comparator branch from 598c2f1 to 15b180e Compare May 21, 2024 17:02
@tustvold tustvold force-pushed the sort-options-comparator branch from 15b180e to 10ce321 Compare May 21, 2024 17:16
fn compare_boolean(left: &dyn Array, right: &dyn Array) -> DynComparator {
let left: BooleanArray = left.as_boolean().clone();
let right: BooleanArray = right.as_boolean().clone();
fn compare_impl<const NULLS_FIRST: bool, const DESCENDING: bool, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit over using boolean parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids conditional branches in the comparator closure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. 💯

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think of this is that the <const ...> template parameters basically force the compiler to make special versions of this function -- so the downside is potentially more code (you now get a new copy of the function for each distinct set of type parameters) but the upside is that each of those copies has more optimization opportunities (like the arguments are constant)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me -- thank you @tustvold . I think it is a very elegant way to solve the problems

Maybe I missed it, but I didn't see any new tests for comparing structs / lists with nested nullability, nor did I see any changes to existing tests.

I think it is important to add tests to both document precisely what the expected behavior as well as ensure we don't introduce a regression in some future refactot

Can you please also run the benchmarks to ensure this doesn't cause any regressions?

fn compare_boolean(left: &dyn Array, right: &dyn Array) -> DynComparator {
let left: BooleanArray = left.as_boolean().clone();
let right: BooleanArray = right.as_boolean().clone();
fn compare_impl<const NULLS_FIRST: bool, const DESCENDING: bool, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think of this is that the <const ...> template parameters basically force the compiler to make special versions of this function -- so the downside is potentially more code (you now get a new copy of the function for each distinct set of type parameters) but the upside is that each of those copies has more optimization opportunities (like the arguments are constant)

///
/// # Postgres-compatible Nested Comparison
///
/// Whilst SQL prescribes ternary logic for nulls, that is comparing a value against a null yields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the use of terms ternary logic / total ordering/etc somewhat hard to grok exactly what this is saying

Perhaps we can offer an example here -- like

If you want to compare nested nulls in your structures so that

{ a: 1, b: null }
null
{ a: 2, b: 1 }

Sorts like

{ a: 2, b: 1 }
{ a: 1, b: null }
null

if you specify DESC NULL LAST

(if I got that right)

/// assert_eq!(std::cmp::Ordering::Less, cmp(0, 1));
/// assert_eq!(cmp(0, 1), Ordering::Less);
///
/// let array1 = Int32Array::from(vec![Some(1), None]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making this a separate section / example might be clearer

Suggested change
/// let array1 = Int32Array::from(vec![Some(1), None]);
///
/// # Example ordering with NULL
/// let array1 = Int32Array::from(vec![Some(1), None]);

Also it might be helpful here to add an example of SortOptions that are not the default (to show the difference)

/// # use arrow_ord::cmp;
/// # use arrow_ord::ord::make_comparator;
/// # use arrow_schema::{ArrowError, SortOptions};
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the API choice 👍

I recommend adding the rationale to the docstring at the stop ("this function does not support overrriding NULL handling to ensure optimized kernels, see make_comparator for more control over ordering)

/// # use arrow_ord::cmp;
/// # use arrow_ord::ord::make_comparator;
/// # use arrow_schema::{ArrowError, SortOptions};
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring talks about comparsions and total order and this function implement eq -- woudl it be better ti implement cmp to better match the text?

///
/// If `nulls_first` is true `NULL` values will be considered less than any non-null value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might help to point people at the faster kernels too

like "see kernels in ord::cmp for fast kernels

@tustvold
Copy link
Contributor Author

Tests are written but waiting on #5811 (as they found a bug in the row format)

@tustvold tustvold marked this pull request as draft May 28, 2024 15:51
@tustvold tustvold marked this pull request as ready for review May 28, 2024 16:56
/// of nested nulls. That is nulls within nested types are either greater than any value,
/// or less than any value (Spark). This could be implemented as below
/// Whilst SQL prescribes ternary logic for nulls, that is comparing a value against a NULL yields
/// a NULL, many systems, including postgres, instead apply a total ordering to comparison of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tustvold tustvold merged commit 9828bf0 into apache:master May 28, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
3 participants