-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add duration datatypes #683
Add duration datatypes #683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @billylanchantin, this is looking awesome!
I dropped some comments, but I think this is a good start!
native/explorer/src/encoding.rs
Outdated
@@ -467,6 +479,7 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor | |||
AnyValue::Date(v) => encode_date(v, env), | |||
AnyValue::Time(v) => encode_time(v, env), | |||
AnyValue::Datetime(v, time_unit, None) => encode_datetime(v, time_unit, env), | |||
AnyValue::Duration(v, _time_unit) => Ok(Some(v).encode(env)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could have a custom type (struct) for this at the Elixir side, and return the time unit with it. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started down that road before I went with integers for simplicity. I still think that may be the right choice, but I wanted to hear y'all's thoughts first.
In particular, bare integers can't capture the units and so don't survive (de)serialization. E.g. this doesn't work:
import Explorer.Series
dt1 = from_list([~N[2023-01-01 00:00:00]])
dt2 = from_list([~N[2023-01-02 00:00:00]])
dt2 |> subtract(dt1) |> to_list |> from_list |> add(dt1)
You get
** (ArgumentError) cannot invoke Explorer.Series.add/2 with mismatched dtypes: :integer and {:datetime, :millisecond}
If we had a custom %Explorer.Duration{}
or similar, the above would work just fine. The downside is that Explorer
would need to internally represent durations. That would increase the maintenance burden and (possibly?) make it harder to implement custom backends.
I still think it may be worth it but that's the tradeoff IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to keep it as integer, but we need to choose a unique representation, which would most likely be nanoseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: I was saying a %Explorer.Duration{}
struct might be the way to go despite the overhead. (Sorry about my wording.)
Representing durations as integers in Elixir forces the programmer to always pass along type hints whenever a duration is initially created. That may be cumbersome if you're doing multiple transformations involving a duration column. This is unlike datetime where, once you've created the initial series, you can do:
datetime = from_list([1649883642 * 1_000 * 1_000], dtype: {:datetime, :millisecond})
#=> #Explorer.Series<
#=> Polars[1]
#=> datetime[μs] [54252-09-24 11:40:00.000000]
#=> >
then
datetime |> to_list |> from_list
#=> #Explorer.Series<
#=> Polars[1]
#=> datetime[μs] [54252-09-24 11:40:00.000000]
which preserves the precision without having to re-pass in the :dtype
arg.
However, it looks like we're already in this situation WRT binary:
binary = from_list([<<228, 146, 51>>, "Elixir"], dtype: :binary)
#=> #Explorer.Series<
#=> Polars[2]
#=> binary [<<228, 146, 51>>, "Elixir"]
#=> >
Here you have that same problem (actually it's even worse because we raise):
binary |> to_list |> from_list
#=> ** (ArgumentError) argument error
So I think my objection isn't as strong.
If we do stay with integers, then I think normalizing on nanosecond precision is the way to go. I'm gonna assume that's the plan for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue in normalizing to nanoseconds: a very large value will overflow 64 bits, which is fine for Elixir, but it will lead issues when converting back to Explorer. So I think you are right, Explorer.Duration
is best way to go indeed.
How would we implement it? Do we keep two fields? Value and precision? Elixir's Calendar types represent the precision as a number (from 0 to 6) we could do the same here and represent it as a number from 0 to 9, but given it will be different anyway, the precision as one of :millisecond | :microsecond | :nanosecond is fine.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to add: in Elixir, the precision is about the representation. If we say the datetime is a single digit, then it is a single digit precision. In here, the precision is about low-level storage (i.e. the biggest/smallest number we can represent). Given they are different, let's go with the precision being an atom (the time unit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking exactly that:
%Explorer.Duration{
value: integer,
precision: :millisecond | :microsecond | :nanosecond
}
We may find we want more fields down the road of course. But right now this struct only needs to hold on to enough information to make the mapping into polars unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by: 98a8ff7
lib/explorer/series.ex
Outdated
@@ -2201,14 +2215,13 @@ defmodule Explorer.Series do | |||
@spec skew(series :: Series.t(), opts :: Keyword.t()) :: float() | non_finite() | nil | |||
def skew(series, opts \\ []) | |||
|
|||
def skew(%Series{dtype: dtype} = series, opts) | |||
when is_numeric_or_date_dtype(dtype) do | |||
# BILLY-NOTE: Bug? They're checking for numeric-or-temporal but only allowing numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is simpler, let's only allow numeric here for now.
lib/explorer/series.ex
Outdated
right :: Series.t() | number() | NaiveDateTime.t() | ||
) :: Series.t() | ||
def add(%NaiveDateTime{} = left, %Series{dtype: {:duration, timeunit}} = right), | ||
do: apply_series_list(:add, [left, right]) |> cast({:datetime, timeunit}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nitpick: I am not sure if we should cast here. If we cast here, we are assuming that the backend does not return something with the correct dtype. I think that's a backend logic: we say what is the expected dtype and the backend is responsible for dealing with it. If we remove the cast, what will happen? Do we get an integer or duration or datetime back?
If we get a different behaviour than polars, could this mean that we are actually not casting NaiveDateTime properly when sending it to Polars? Or are we choosing a different behaviour than Polars? If so, why?
The same rationale applies to cast
in subtract, multiply, and divide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove the cast, what will happen? Do we get an integer or duration or datetime back?
The rationale was maintaining highest precision. If we do
{:datetime, :millisecond} + {:duration, :nanosecond}
I wanted to make sure we ended up with
{:datetime, :nanosecond}
as the return type. Otherwise we'd lose information. However, I'm now realizing that we've already lost information because of the way I've written it 🤦. I need to be casting before the call to polars like I did with the other clauses.
import Explorer.Series
datetime = ~N[2023-01-01 00:00:00]
duration = from_list([100], dtype: {:duration, :nanosecond})
# Current implementation:
datetime |> add(duration) |> cast(:integer) |> to_list
#=> [1672531200000000000]
# What I actually wanted:
[datetime] |> from_list(dtype: {:datetime, :nanosecond}) |> add(duration) |> cast(:integer) |> to_list
#=> [1672531200000000100]
I'll rework the clauses where I'm post-casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I would like to avoid the postcast in general. I also think we should consider converting the NaiveDateTime to a series in series.ex
, so the backend does not have to handle it either.
There was a reason why we didn't do it, but i don't fully recall. I will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked, I think this will be fine: if any of the arguments is a %NaiveDateTime{}
, you can call from_list([dt], {:datetime, precision})
. This means we won't need to deal with it in the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this for add and subtract here: 2f77e76
Multiply and divide seem determined to get returned as integers. Will investigate what's going on tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think I know what's going on. Polars seems to be treating scalar and element-wise multiplication differently.
https://www.rustexplorer.com/b/4cdukd
println!("{:?}", diff); // [1d]
println!("{:?}", Series::new("prod1", &diff * 10)); // [10d]
println!("{:?}", Series::new("prod2", (10).mul(&diff))); // [10d]
println!("{:?}", Series::new("prod3", &ten * &diff)); // [864000000]
println!("{:?}", Series::new("prod4", &diff * &ten)); // [864000000]
Unfortunately it's the element-wise multiplication that's returning integers. Also:
println!("{:?}", Series::new("prod4", &diff * &diff));
//=> mul operation not supported for dtypes `duration[ms]` and `duration[ms]`
So I think we either have to drop support or post-cast.
I wonder if this is a bug in polars. Seems like an odd choice if it's intentional, especially since pandas supports it:
import pandas as pd
duration = pd.Series([pd.Timedelta(1, "d"), pd.Timedelta(2, "d")])
integer = pd.Series([10, 10])
print(duration * integer)
#=> 0 10 days
#=> 1 20 days
#=> dtype: timedelta64[ns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @billylanchantin! 💯
The direction is great, we added only some minor comments so we can go ahead with this.
Also note we need to add Series.to_iovec support. This should be similar to encoding datetime in encoding.rs, so I am hoping it is not a major addition. :)
Co-authored-by: José Valim <[email protected]>
@@ -493,6 +506,7 @@ pub fn list_from_series(s: ExSeries, env: Env) -> Result<Term, ExplorerError> { | |||
DataType::Date => date_series_to_list(&s, env), | |||
DataType::Time => time_series_to_list(&s, env), | |||
DataType::Datetime(time_unit, None) => datetime_series_to_list(&s, *time_unit, env), | |||
DataType::Duration(time_unit) => duration_series_to_list(&s, *time_unit, env), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that whatever we decide above we need to mirror here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by: 98a8ff7
lib/explorer/backend/lazy_series.ex
Outdated
end | ||
end | ||
|
||
defp resolve_numeric_temporal_dtype(op, left, %Series{dtype: dtype} = right) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this clause, right? NaiveDateTimes will be normalized to series in series.ex, so no longer a concern here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so. Done: 98c2885
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we can make a similar simplification to resolve_numeric_dtype/1
in that same module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another PR for this, which I want to merge after this one to not mess up your code. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass. I have added some minor notes, but it seems the major decisions have been taken, so we are on a solid path!
@josevalim Awesome, thanks! I feel like we're in a pretty good spot too :) One other major decision to be made: do we want to support |
Co-authored-by: José Valim <[email protected]>
Let’s add date-date in another PR so we pack this one sooner. :) |
# Each series displays its values as "[1ms]" as well as the correct precision. | ||
assert inspect(ms) == """ | ||
#Explorer.Series< | ||
Polars[1] | ||
duration[ms] [1ms] | ||
>\ | ||
""" | ||
|
||
assert inspect(us) == """ | ||
#Explorer.Series< | ||
Polars[1] | ||
duration[μs] [1ms] | ||
>\ | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pass at implementing the string-related protocols while I was adding the duration struct. Not having them broke one of my tests (plus I was tired of counting 0's). Feel free to 👍 / 👎 on the specifics. I mostly tried to copy what polars was doing. But for inspect
I went with:
inspect(%Explorer.Duration{value: 1, precision: :millisecond})
#=> "Duration[1ms]"
I don't feel strongly about that choice, I just went with something.
This is awesome @billylanchantin, thank you so much! 💚 💙 💜 💛 ❤️ This is complete as is and will unblock us on upcoming work. Feel free to send PRs for the time-time and the other missing pieces. :) |
Just one additional note. I removed the integer/duration operations from multiply/divide because we are not allowing integers on add/subtract, so I thought we should wait and tackle all of this at once. :) |
Another note: to support operations between duration and integers, I think you can change basic_numeric_operation directly. I have just pushed some changes to make that a bit easier too. |
Of course! I'm glad we could add this :) Thanks for all the feedback, @josevalim and @philss!
Will do! Not sure when exactly, but I've got a vested interest in this functionality. I'm working on a project that's attempting to predict the length of a duration.
👍
Noted! Things I'll be keeping in mind:
|
[toolchain] | ||
channel = "nightly-2023-06-23" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to call this out earlier. The most recent rust nightly didn't work for me and I had to use the specific version from the README. I've never really used rust before and IDK if what I did is a good practice. 🤷♂️
Just wanted to make sure this wasn't lost in the shuffle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! I think it's a good idea to keep this file. Polars is doing the same.
I think we can simplify our CI files to read from this, but I don't know how yet.
Thanks for pointing this out!
I have just pushed a refactoring to |
Description
This is a WIP. I'm hoping for some feedback before I get too into the weeds!
EDIT: Rewrote this section for clarity.
Adds the following dtypes:
{:duration, :nanosecond}
{:duration, :microsecond}
{:duration, :millisecond}
Adds support for some basic operations with those dtypes:
Explorer.Series.from_list/2
Explorer.Series.{add/2, subtract/2, multiply/2, divide/2}
{:datetime, *}
dtypesIn principle we could also support most 1-arity functions which accept
:integer
since durations are:integer
under the hood. But I wanted to stop here to see if this is roughly what the team had in mind.Examples
Adapted from
test/duration_test.exs
:Series.add/2
Click to show/hide more examples
Series.subtract/2
DataFrame.mutate/2
Related Work
Overview of approach
EDIT: Rewrote this section for clarity.
The main idea is to rely on polars native support for durations. For example:
https://www.rustexplorer.com/b/yzh47r
To enable this, I've created explicit duration dtypes on the Elixir side and mappings to/from the Rust side which mirror the equivalent ones for
datetime
:s_from_list_datetime
s_from_list_duration
<-- newdatetime_series_to_list
duration_series_to_list
<-- newThen on the Elixir side, I'm doing a bunch of case work to make sure the
Explorer.Series.{add,subtract,multiply,divide}
functions recognize situations where a duration is involved and let that operation through. Before they'd only allow something like (pseudocode)::integer - :integer
They now allow things like:
{:datetime, :microsecond} - {:duration, :nanosecond}
This required that I figure out what situations are allowed ahead of time. I took a stab at what I thought the allowed operations should be:
Click to show/hide
:add
%NaiveDateTime{}
{:duration, r}
{:datetime, r}
:add
{:duration, l}
%NaiveDateTime{}
{:datetime, l}
:add
{:datetime, l}
{:duration, r}
{:datetime, higher(l, r)}
:add
{:duration, l}
{:datetime, r}
{:datetime, higher(l, r)}
:add
{:duration, l}
{:duration, r}
{:duration, higher(l, r)}
:subtract
%NaiveDateTime{}
{:datetime, r}
{:duration, r}
:subtract
{:datetime, l}
%NaiveDateTime{}
{:duration, l}
:subtract
%NaiveDateTime{}
{:duration, r}
{:datetime, r}
:subtract
{:datetime, l}
{:datetime, r}
{:duration, higher(l, r)}
:subtract
{:datetime, l}
{:duration, r}
{:datetime, higher(l, r)}
:subtract
{:duration, l}
{:duration, r}
{:duration, higher(l, r)}
:multiply
{:duration, l}
:integer
{:duration, l}
:multiply
:integer
{:duration, r}
{:duration, r}
:divide
{:duration, l}
:integer
{:duration, l}
where
higher/2
selects the higher precision assuming:nanosecond > :microsecond > :millisecond
.We can edit this as needed.
Notably, durations are currently getting mapped to bare integers on the Elixir side. E.g.
Since there's no native duration in Elixir, I think that's the best we can do with this approach unless we create our own
%Explorer.Duration{}
.Open questions
%Explorer.Duration{}
Time
andDate
? I only implemented comparingNaiveDateTime
s. Doing so will be a bit tricky.date - date
in a later PR and won't supporttime - time
at all.To-do
Series.to_iovec
support%Explorer.Duration{}
instead of bare integersSeries.skew