Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datetime encoding overflow #1020

Merged
merged 7 commits into from
Nov 21, 2024
Merged

Conversation

billylanchantin
Copy link
Contributor

@billylanchantin billylanchantin commented Nov 20, 2024

Fixes one of the #1014 issues with printing dataframes.

This one stems from how we encode datetimes. Example:

dtypes = [{"col_j", {:naive_datetime, :millisecond}}]
rows = [[{"col_j", ~N[8332-06-03 16:03:40.489446]}]]
df = DF.new(rows, dtypes: dtypes)
DF.print(df)

Panics with:

thread '<unnamed>' panicked at src/encoding.rs:167:35:
attempt to multiply with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/billy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-arrow-0.43.1/src/temporal_conversions.rs:159:37:
invalid or out-of-range datetime


  1) test specific property failures 1 (Explorer.DataFrameTest)
     test/explorer/data_frame_test.exs:4850
     ** (ErlangError) Erlang error: :nif_panicked
     code: DF.print(df)
     stacktrace:
       (explorer 0.11.0-dev) Explorer.PolarsBackend.Native.s_to_list(#Explorer.PolarsBackend.Series<
  #Reference<0.2007251331.1252655135.131382>
>)
       (explorer 0.11.0-dev) lib/explorer/polars_backend/shared.ex:24: Explorer.PolarsBackend.Shared.apply_series/3
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5981: anonymous fn/2 in Explorer.DataFrame.non_empty_table_string/2
       (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5979: Explorer.DataFrame.non_empty_table_string/2
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5945: Explorer.DataFrame.print/2
       test/explorer/data_frame_test.exs:4854: (test)

The culprit is how we encode datetimes: we always turn them into microseconds first. This is usually fine. But with certain operations like printing, especially when the datetime is far from the unix epoch, it can result in overflow.

My proposed fix is to build the i64 representation directly from the (%DateTime{}, time_unit) pair using (mostly) Polars functions. This works -- and I added some properties to be confident -- but it has the drawback of us no longer being able to support impl From<ExNaiveDateTime> for i64 since we need to know which time unit the user wants to represent the datetime with. We didn't use that impl much so I think the drawback is acceptable.

We're also handling the case where nanosecond precision datetimes can't be represented more explicitly.

Before we were always converting first to a
microsecond-based representation then to the
final representation. The intermediate conversion
is unecessary and risks overflows when trying to
convert to a different time unit later. This approach
converts directly to i64 from the Elixir struct and
time unit.

The drawback is that we can no longer build i64
datetimes directly from Elixir structs.
* Restrict nanosecond datetimes to a set
  which is representable by i64
* Make datetime generation funs public
I'm surprised this worked...
* `format!("...")` to `"...".to_string()`
* `.map_or(...)` to `.ok_or`
* `x.clone()` to `*x`
* `.map(|x| Some(x))` to `.map(Some)`
native/explorer/src/datatypes.rs Outdated Show resolved Hide resolved
native/explorer/src/encoding.rs Show resolved Hide resolved
@billylanchantin
Copy link
Contributor Author

Turns out this branch fixes more than I thought. main is not encoding :nanosecond datetimes correctly:

MIX_ENV=test iex -S mix # test env to get access to non-UTC timezones
iex> dt = %DateTime{year: 2017, month: 11, day: 7, zone_abbr: "CET",
...>                hour: 11, minute: 45, second: 18, microsecond: {123456, 6},
...>                utc_offset: 3600, std_offset: 0, time_zone: "Europe/Paris"}
#DateTime<2017-11-07 11:45:18.123456+01:00 CET Europe/Paris>

# This branch (correct)
iex> [dt] |> Explorer.Series.from_list(dtype: {:datetime, :nanosecond, dt.time_zone})
#Explorer.Series<
  Polars[1]
  datetime[ns, Europe/Paris] [2017-11-07 11:45:18.123456+01:00 CET Europe/Paris]
>

# main (way off)
iex> [dt] |> Explorer.Series.from_list(dtype: {:datetime, :nanosecond, dt.time_zone})
#Explorer.Series<
  Polars[1]
  datetime[ns, Europe/Paris] [1970-01-18 11:27:35.118123+01:00 CET Europe/Paris]
>

This change reverts that. It also refactors the
change in `series.rs` to be more readable.
@billylanchantin billylanchantin merged commit f287e08 into main Nov 21, 2024
4 checks passed
@billylanchantin billylanchantin deleted the bl-fix-datetime-encoding-overflow branch November 21, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants