Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Substrait's IntervalCompound type/literal instead of interval-month-day-nano UDT #12112

Merged
merged 8 commits into from
Oct 29, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Previously, Substrait didn't have a type & literal for the intervals that combine year/month and day/time, so DF had to use a custom type to support those. substrait-io/substrait#665 however added support for IntervalCompound, which is a superset of DF's IntervalMonthDayNano - so we can switch over.

What changes are included in this PR?

Add support for consuming Substrait's IntervalCompound type/literals (in addition to existing UDTs, for backwards compatibility)
Switch Substrait producing to produce IntervalCompound type/literals

Are these changes tested?

Tested through existing unit tests.

However, as we now don't produce any UDTs, we cannot easily test the UDT-producing code anymore.

Are there any user-facing changes?

Substrait now produces IntervalCompound types instead of interval-month-day-nano UDTs. Both types can be consumed.

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Oct 22, 2024
@Blizzara Blizzara force-pushed the avo/substrait-interval-compound branch from d89da8e to 66cb65a Compare October 28, 2024 00:06
@Blizzara Blizzara marked this pull request as ready for review October 28, 2024 00:06
)]
pub const INTERVAL_MONTH_DAY_NANO_TYPE_REF: u32 = 3;

/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`].
///
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
/// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano
#[deprecated(
since = "42.1.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that should be next version, what will it be? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

43 I believe: #12470

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in d527b4e

@@ -2310,39 +2306,6 @@ mod test {
Ok(())
}

#[test]
fn custom_type_literal_extensions() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly as we no longer produce custom types, there's no easy way to test for them either :/
maybe if we one day implement some UDT support then that can be used for testing here instead. But for now there's no good replacement.

@Blizzara Blizzara force-pushed the avo/substrait-interval-compound branch from 66cb65a to f1e5176 Compare October 28, 2024 00:44
@Blizzara Blizzara force-pushed the avo/substrait-interval-compound branch from f1e5176 to e18ec17 Compare October 28, 2024 01:38
@Blizzara
Copy link
Contributor Author

this should be good to review, fyi @alamb @vbarua @westonpace @tokoko (feel free to lmk if you'd prefer not to be tagged :))

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Oct 28, 2024
p
);
}
let nanos = *subseconds * i64::pow(10, (*p - 9) as u32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't precision handling here and on L2294 for IntervalDayToSecond be similar? Seems like only 0, 3, 6, 9 are supported over there. This one makes more sense to me, just curious why they are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very fair question! I think in general the answer is "yes", but there's a catch - the IntervalDayTime in Arrow only supports millisecond precision, so it cannot be done with a single i64::pow. It could be done with two (making it more complicated), or with a fp64::pow (leading to floating point ops).

But there is an additional question of should we allow the silent precision-losing operations (and the answer is probably no, at least not w/o some user config option) so I think that parts needs a better look anyways and I was thinking of doing it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I think there are two questions (probably unrelated?) to explore here: 1) precision-losing (for p>3) and 2) whether in-between values are valid (p==1, p==2)

@alamb
Copy link
Contributor

alamb commented Oct 28, 2024

this should be good to review, fyi @alamb @vbarua @westonpace @tokoko (feel free to lmk if you'd prefer not to be tagged :))

Please do continue to tag me -- thank you very much

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @Blizzara

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 comments but nothing blocking from me

@@ -1842,7 +1848,7 @@ fn from_substrait_type(
),
}
} else {
// Kept for backwards compatibility, new plans should include the extension instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I would suggest updating this warning here and further down to something like

// Kept for backwards compatibility
// Producers should use IntervalCompound instead

to make it clearer what/who should be using the new type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1831,8 +1832,13 @@ fn from_substrait_type(
Ok(DataType::Interval(IntervalUnit::YearMonth))
}
r#type::Kind::IntervalDay(_) => Ok(DataType::Interval(IntervalUnit::DayTime)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly out of scope of this PR, but do we need to checking the precision on this type now?

Based on https://github.com/substrait-io/substrait/blob/683f4179a058c2c99c04501b920a48ff372356ff/proto/substrait/type.proto#L132-L140 DataFusion can only handle this type with precision loss when precision is set to 0 or 3 (seconds or milliseconds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup see #12112 (comment). I think we should change it, maybe with some option so that user can decide if they want to throw or accept the precision loss. But will be a separate PR :)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Only one small question.

p
);
}
let nanos = *subseconds * i64::pow(10, (*p - 9) as u32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably butchering this but should this be 9 - *p instead of *p - 9? If precision is 0 then we want 10^9 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think you’re right 🙈 math is hard.. I’ll fix this tomorrow. Thanks for catching it!

Copy link
Contributor

@alamb alamb Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think you’re right 🙈 math is hard.. I’ll fix this tomorrow. Thanks for catching it!

Marking this PR as draft so we don't accidentally merge it until it is fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 79f6341 and added a test in ea8bc4a, thanks again for catching this @westonpace!

Thanks @alamb, I converted back to PR now and I think it's ready to merge :)

@alamb alamb marked this pull request as draft October 29, 2024 11:34
@Blizzara Blizzara marked this pull request as ready for review October 29, 2024 13:38
@alamb
Copy link
Contributor

alamb commented Oct 29, 2024

I'll plan to merge this PR once the CI checks have completed

@alamb
Copy link
Contributor

alamb commented Oct 29, 2024

🚀

@alamb alamb merged commit 444a673 into apache:main Oct 29, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 29, 2024

Thanks again @Blizzara @vbarua @westonpace and @tokoko 👥

@Blizzara Blizzara deleted the avo/substrait-interval-compound branch October 29, 2024 15:15
@alamb alamb mentioned this pull request Nov 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants