Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duration datatypes #683

Merged
merged 30 commits into from
Aug 24, 2023

Conversation

billylanchantin
Copy link
Member

@billylanchantin billylanchantin commented Aug 20, 2023

Description

This is a WIP. I'm hoping for some feedback before I get too into the weeds!

EDIT: Rewrote this section for clarity.

Adds the following dtypes:

  • {:duration, :nanosecond}
  • {:duration, :microsecond}
  • {:duration, :millisecond}

Adds support for some basic operations with those dtypes:

  • Explorer.Series.from_list/2
  • Explorer.Series.{add/2, subtract/2, multiply/2, divide/2}
    • These now work in combination with the {:datetime, *} dtypes

In principle we could also support most 1-arity functions which accept :integer since durations are :integer under the hood. But I wanted to stop here to see if this is roughly what the team had in mind.

Examples

Adapted from test/duration_test.exs:

Series.add/2

iex(1)> eleven = Series.from_list([~N[2023-08-20 11:00:00.0000000]])
iex(2)> one_hour = Series.from_list([3600 * 1_000], dtype: {:duration, :millisecond})
#Explorer.Series<
  Polars[1]
  duration[ms] [3600000]
>
iex(3)> sum = Series.add(eleven, one_hour)
#Explorer.Series<
  Polars[1]
  datetime[μs] [2023-08-20 12:00:00.000000]
>
Click to show/hide more examples

Series.subtract/2

iex(1)> eleven = Series.from_list([~N[2023-08-20 11:00:00.0000000]])
iex(2)> twelve = Series.from_list([~N[2023-08-20 12:00:00.0000000]])
iex(3)> one_hour = Series.subtract(twelve, eleven)
#Explorer.Series<
  Polars[1]
  duration[μs] [3600000000]
>

DataFrame.mutate/2

iex(1)> df = DF.new(eleven: [~N[2023-08-20 12:00:00.0000000]], twelve: [~N[2023-08-20 11:00:00.0000000]])
iex(2)> df_with_diff = DF.mutate(df, diff: twelve - eleven)
#Explorer.DataFrame<
  Polars[1 x 3]
  eleven datetime[μs] [2023-08-20 11:00:00.000000]
  twelve datetime[μs] [2023-08-20 12:00:00.000000]
  diff duration[μs] [3600000000]
>

Related Work

Overview of approach

EDIT: Rewrote this section for clarity.

The main idea is to rely on polars native support for durations. For example:

https://www.rustexplorer.com/b/yzh47r

// Excerpt from example
// ...
let dt1 = Series::new("datetime1", dt1_values);
let dt2 = Series::new("datetime2", dt2_values);
let diff = Series::new("diff", &dt2 - &dt1);
println!("{:?}", diff);
shape: (2,)
Series: 'diff' [duration[ms]]
[
	2d
	3d
]

To enable this, I've created explicit duration dtypes on the Elixir side and mappings to/from the Rust side which mirror the equivalent ones for datetime:

  • s_from_list_datetime
  • s_from_list_duration <-- new
  • datetime_series_to_list
  • duration_series_to_list <-- new

Then on the Elixir side, I'm doing a bunch of case work to make sure the Explorer.Series.{add,subtract,multiply,divide} functions recognize situations where a duration is involved and let that operation through. Before they'd only allow something like (pseudocode):

  • :integer - :integer

They now allow things like:

  • {:datetime, :microsecond} - {:duration, :nanosecond}

This required that I figure out what situations are allowed ahead of time. I took a stab at what I thought the allowed operations should be:

Click to show/hide
Operation Left Right Output
:add %NaiveDateTime{} {:duration, r} {:datetime, r}
:add {:duration, l} %NaiveDateTime{} {:datetime, l}
:add {:datetime, l} {:duration, r} {:datetime, higher(l, r)}
:add {:duration, l} {:datetime, r} {:datetime, higher(l, r)}
:add {:duration, l} {:duration, r} {:duration, higher(l, r)}
:subtract %NaiveDateTime{} {:datetime, r} {:duration, r}
:subtract {:datetime, l} %NaiveDateTime{} {:duration, l}
:subtract %NaiveDateTime{} {:duration, r} {:datetime, r}
:subtract {:datetime, l} {:datetime, r} {:duration, higher(l, r)}
:subtract {:datetime, l} {:duration, r} {:datetime, higher(l, r)}
:subtract {:duration, l} {:duration, r} {:duration, higher(l, r)}
:multiply {:duration, l} :integer {:duration, l}
:multiply :integer {:duration, r} {:duration, r}
:divide {:duration, l} :integer {:duration, l}

where higher/2 selects the higher precision assuming :nanosecond > :microsecond > :millisecond.


We can edit this as needed.

Notably, durations are currently getting mapped to bare integers on the Elixir side. E.g.

eleven = Series.from_list([~N[2023-08-20 11:00:00.0000000]])
twelve = Series.from_list([~N[2023-08-20 12:00:00.0000000]])
diff = Series.subtract(twelve, eleven)

diff |> Series.to_list() |> IO.inspect() #=> [3_600_000_000]

Since there's no native duration in Elixir, I think that's the best we can do with this approach unless we create our own %Explorer.Duration{}.

Open questions

  • Are we okay with the durations being bare integers on the elixir side?
    • No, we're going with %Explorer.Duration{}
  • Should we support durations for Time and Date? I only implemented comparing NaiveDateTimes. Doing so will be a bit tricky.
    • No, we'll support date - date in a later PR and won't support time - time at all.
  • How do we feel about picking the highest precision? (See the operation examples)
    • 👍

To-do

  • Series.to_iovec support
  • Use %Explorer.Duration{} instead of bare integers
  • Don't post-cast
    • add / subtract
    • multiply / divide (blocked)
  • More tests for series
  • More tests for dataframes
  • Revisit property tests
  • Revisit 1-arity functions like Series.skew

Copy link
Contributor

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @billylanchantin, this is looking awesome!

I dropped some comments, but I think this is a good start!

lib/explorer/backend/lazy_series.ex Outdated Show resolved Hide resolved
lib/explorer/backend/lazy_series.ex Show resolved Hide resolved
@@ -467,6 +479,7 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor
AnyValue::Date(v) => encode_date(v, env),
AnyValue::Time(v) => encode_time(v, env),
AnyValue::Datetime(v, time_unit, None) => encode_datetime(v, time_unit, env),
AnyValue::Duration(v, _time_unit) => Ok(Some(v).encode(env)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could have a custom type (struct) for this at the Elixir side, and return the time unit with it. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started down that road before I went with integers for simplicity. I still think that may be the right choice, but I wanted to hear y'all's thoughts first.

In particular, bare integers can't capture the units and so don't survive (de)serialization. E.g. this doesn't work:

import Explorer.Series

dt1 = from_list([~N[2023-01-01 00:00:00]])
dt2 = from_list([~N[2023-01-02 00:00:00]])

dt2 |> subtract(dt1) |> to_list |> from_list |> add(dt1)

You get

** (ArgumentError) cannot invoke Explorer.Series.add/2 with mismatched dtypes: :integer and {:datetime, :millisecond}

If we had a custom %Explorer.Duration{} or similar, the above would work just fine. The downside is that Explorer would need to internally represent durations. That would increase the maintenance burden and (possibly?) make it harder to implement custom backends.

I still think it may be worth it but that's the tradeoff IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to keep it as integer, but we need to choose a unique representation, which would most likely be nanoseconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: I was saying a %Explorer.Duration{} struct might be the way to go despite the overhead. (Sorry about my wording.)

Representing durations as integers in Elixir forces the programmer to always pass along type hints whenever a duration is initially created. That may be cumbersome if you're doing multiple transformations involving a duration column. This is unlike datetime where, once you've created the initial series, you can do:

datetime = from_list([1649883642 * 1_000 * 1_000], dtype: {:datetime, :millisecond})
#=> #Explorer.Series<
#=>   Polars[1]
#=>   datetime[μs] [54252-09-24 11:40:00.000000]
#=> >

then

datetime |> to_list |> from_list
#=> #Explorer.Series<
#=>   Polars[1]
#=>   datetime[μs] [54252-09-24 11:40:00.000000]

which preserves the precision without having to re-pass in the :dtype arg.

However, it looks like we're already in this situation WRT binary:

binary = from_list([<<228, 146, 51>>, "Elixir"], dtype: :binary)
#=> #Explorer.Series<
#=>   Polars[2]
#=>   binary [<<228, 146, 51>>, "Elixir"]
#=> >

Here you have that same problem (actually it's even worse because we raise):

binary |> to_list |> from_list
#=> ** (ArgumentError) argument error

So I think my objection isn't as strong.

If we do stay with integers, then I think normalizing on nanosecond precision is the way to go. I'm gonna assume that's the plan for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue in normalizing to nanoseconds: a very large value will overflow 64 bits, which is fine for Elixir, but it will lead issues when converting back to Explorer. So I think you are right, Explorer.Duration is best way to go indeed.

How would we implement it? Do we keep two fields? Value and precision? Elixir's Calendar types represent the precision as a number (from 0 to 6) we could do the same here and represent it as a number from 0 to 9, but given it will be different anyway, the precision as one of :millisecond | :microsecond | :nanosecond is fine.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to add: in Elixir, the precision is about the representation. If we say the datetime is a single digit, then it is a single digit precision. In here, the precision is about low-level storage (i.e. the biggest/smallest number we can represent). Given they are different, let's go with the precision being an atom (the time unit).

Copy link
Member Author

@billylanchantin billylanchantin Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking exactly that:

%Explorer.Duration{
  value: integer,
  precision: :millisecond | :microsecond | :nanosecond
}

We may find we want more fields down the road of course. But right now this struct only needs to hold on to enough information to make the mapping into polars unambiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by: 98a8ff7

@@ -2201,14 +2215,13 @@ defmodule Explorer.Series do
@spec skew(series :: Series.t(), opts :: Keyword.t()) :: float() | non_finite() | nil
def skew(series, opts \\ [])

def skew(%Series{dtype: dtype} = series, opts)
when is_numeric_or_date_dtype(dtype) do
# BILLY-NOTE: Bug? They're checking for numeric-or-temporal but only allowing numeric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is simpler, let's only allow numeric here for now.

right :: Series.t() | number() | NaiveDateTime.t()
) :: Series.t()
def add(%NaiveDateTime{} = left, %Series{dtype: {:duration, timeunit}} = right),
do: apply_series_list(:add, [left, right]) |> cast({:datetime, timeunit})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick: I am not sure if we should cast here. If we cast here, we are assuming that the backend does not return something with the correct dtype. I think that's a backend logic: we say what is the expected dtype and the backend is responsible for dealing with it. If we remove the cast, what will happen? Do we get an integer or duration or datetime back?

If we get a different behaviour than polars, could this mean that we are actually not casting NaiveDateTime properly when sending it to Polars? Or are we choosing a different behaviour than Polars? If so, why?

The same rationale applies to cast in subtract, multiply, and divide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the cast, what will happen? Do we get an integer or duration or datetime back?

The rationale was maintaining highest precision. If we do

{:datetime, :millisecond} + {:duration, :nanosecond}

I wanted to make sure we ended up with

{:datetime, :nanosecond}

as the return type. Otherwise we'd lose information. However, I'm now realizing that we've already lost information because of the way I've written it 🤦. I need to be casting before the call to polars like I did with the other clauses.

import Explorer.Series

datetime = ~N[2023-01-01 00:00:00]
duration = from_list([100], dtype: {:duration, :nanosecond})

# Current implementation:
datetime |> add(duration) |> cast(:integer) |> to_list
#=> [1672531200000000000]

# What I actually wanted:
[datetime] |> from_list(dtype: {:datetime, :nanosecond}) |> add(duration) |> cast(:integer) |> to_list
#=> [1672531200000000100]

I'll rework the clauses where I'm post-casting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I would like to avoid the postcast in general. I also think we should consider converting the NaiveDateTime to a series in series.ex, so the backend does not have to handle it either.

There was a reason why we didn't do it, but i don't fully recall. I will investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked, I think this will be fine: if any of the arguments is a %NaiveDateTime{}, you can call from_list([dt], {:datetime, precision}). This means we won't need to deal with it in the backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this for add and subtract here: 2f77e76

Multiply and divide seem determined to get returned as integers. Will investigate what's going on tomorrow.

Copy link
Member Author

@billylanchantin billylanchantin Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I know what's going on. Polars seems to be treating scalar and element-wise multiplication differently.

https://www.rustexplorer.com/b/4cdukd

println!("{:?}", diff);                                  // [1d]
println!("{:?}", Series::new("prod1", &diff * 10));      // [10d]
println!("{:?}", Series::new("prod2", (10).mul(&diff))); // [10d]
println!("{:?}", Series::new("prod3", &ten * &diff));    // [864000000]
println!("{:?}", Series::new("prod4", &diff * &ten));    // [864000000]

Unfortunately it's the element-wise multiplication that's returning integers. Also:

println!("{:?}", Series::new("prod4", &diff * &diff));
//=> mul operation not supported for dtypes `duration[ms]` and `duration[ms]`

So I think we either have to drop support or post-cast.

I wonder if this is a bug in polars. Seems like an odd choice if it's intentional, especially since pandas supports it:

import pandas as pd

duration = pd.Series([pd.Timedelta(1, "d"), pd.Timedelta(2, "d")])
integer = pd.Series([10, 10])
print(duration * integer)
#=> 0   10 days
#=> 1   20 days
#=> dtype: timedelta64[ns]

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @billylanchantin! 💯

The direction is great, we added only some minor comments so we can go ahead with this.

Also note we need to add Series.to_iovec support. This should be similar to encoding datetime in encoding.rs, so I am hoping it is not a major addition. :)

lib/explorer/series.ex Outdated Show resolved Hide resolved
lib/explorer/series.ex Outdated Show resolved Hide resolved
@@ -493,6 +506,7 @@ pub fn list_from_series(s: ExSeries, env: Env) -> Result<Term, ExplorerError> {
DataType::Date => date_series_to_list(&s, env),
DataType::Time => time_series_to_list(&s, env),
DataType::Datetime(time_unit, None) => datetime_series_to_list(&s, *time_unit, env),
DataType::Duration(time_unit) => duration_series_to_list(&s, *time_unit, env),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that whatever we decide above we need to mirror here. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by: 98a8ff7

end
end

defp resolve_numeric_temporal_dtype(op, left, %Series{dtype: dtype} = right) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this clause, right? NaiveDateTimes will be normalized to series in series.ex, so no longer a concern here. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. Done: 98c2885

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we can make a similar simplification to resolve_numeric_dtype/1 in that same module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another PR for this, which I want to merge after this one to not mess up your code. :)

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass. I have added some minor notes, but it seems the major decisions have been taken, so we are on a solid path!

@billylanchantin
Copy link
Member Author

@josevalim Awesome, thanks! I feel like we're in a pretty good spot too :)

One other major decision to be made: do we want to support date - date? I checked and polars does support it (and does not support time - time):

https://www.rustexplorer.com/b/cpbrw8

@josevalim
Copy link
Member

Let’s add date-date in another PR so we pack this one sooner. :)

Comment on lines +112 to +125
# Each series displays its values as "[1ms]" as well as the correct precision.
assert inspect(ms) == """
#Explorer.Series<
Polars[1]
duration[ms] [1ms]
>\
"""

assert inspect(us) == """
#Explorer.Series<
Polars[1]
duration[μs] [1ms]
>\
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at implementing the string-related protocols while I was adding the duration struct. Not having them broke one of my tests (plus I was tired of counting 0's). Feel free to 👍 / 👎 on the specifics. I mostly tried to copy what polars was doing. But for inspect I went with:

inspect(%Explorer.Duration{value: 1, precision: :millisecond})
#=> "Duration[1ms]"

I don't feel strongly about that choice, I just went with something.

lib/explorer/series.ex Outdated Show resolved Hide resolved
lib/explorer/series.ex Outdated Show resolved Hide resolved
@josevalim josevalim merged commit 6836f29 into elixir-explorer:main Aug 24, 2023
@josevalim
Copy link
Member

This is awesome @billylanchantin, thank you so much! 💚 💙 💜 💛 ❤️

This is complete as is and will unblock us on upcoming work. Feel free to send PRs for the time-time and the other missing pieces. :)

@josevalim
Copy link
Member

josevalim commented Aug 24, 2023

Just one additional note. I removed the integer/duration operations from multiply/divide because we are not allowing integers on add/subtract, so I thought we should wait and tackle all of this at once. :)

@josevalim
Copy link
Member

Another note: to support operations between duration and integers, I think you can change basic_numeric_operation directly. I have just pushed some changes to make that a bit easier too.

@billylanchantin
Copy link
Member Author

Of course! I'm glad we could add this :) Thanks for all the feedback, @josevalim and @philss!

Feel free to send PRs for the time-time and the other missing pieces.

Will do! Not sure when exactly, but I've got a vested interest in this functionality. I'm working on a project that's attempting to predict the length of a duration.

I thought we should wait and tackle all of [the integer/duration stuff] at once. :)

👍

to support operations between duration and integers, I think you can change basic_numeric_operation directly.

Noted! Things I'll be keeping in mind:

  • Maintaining highest precision
  • Allowable types are operation-dependent, e.g.:
    • duration + datetime is allowed
    • duration - datetime is not
  • Polars is inconsistent with their types AFAICT

Comment on lines +1 to +2
[toolchain]
channel = "nightly-2023-06-23"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to call this out earlier. The most recent rust nightly didn't work for me and I had to use the specific version from the README. I've never really used rust before and IDK if what I did is a good practice. 🤷‍♂️

Just wanted to make sure this wasn't lost in the shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I think it's a good idea to keep this file. Polars is doing the same.
I think we can simplify our CI files to read from this, but I don't know how yet.

Thanks for pointing this out!

@josevalim
Copy link
Member

I have just pushed a refactoring to basic_numeric_operation and the boolean mask now called cast_to_ordered_series which should make it easier to add operations between integers and durations.

@billylanchantin billylanchantin deleted the add-duration-datatypes branch August 25, 2023 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants