Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle datetime arrays with timezone information? #3656

Open
weiji14 opened this issue Nov 28, 2024 · 3 comments
Open

How to handle datetime arrays with timezone information? #3656

weiji14 opened this issue Nov 28, 2024 · 3 comments
Labels
question Further information is requested

Comments

@weiji14
Copy link
Member

weiji14 commented Nov 28, 2024

Originally posted by @weiji14 in #3621 (comment)

Opening a discussion on whether to:

  1. Follow GMT (allow users to plot data at a non-UTC timezone, by ignoring the timezone offset) [breaking change]
  2. Follow NumPy, whereby the data will from any non-UTC timezone will be converted to UTC always [current behaviour]

If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the _to_numpy function, or having to keep array_to_datetime (so don't merge #3507 yet).

class/type has timezone support Link
Python datetime https://docs.python.org/3/library/datetime.html#datetime.datetime.tzinfo
pandas.Timestamp https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.Timestamp.html
pyarrow.timestamp (type) https://arrow.apache.org/docs/17.0/python/generated/pyarrow.timestamp.html
numpy.datetime64 https://numpy.org/doc/2.1/reference/arrays.scalars.html#numpy.datetime64

Context is that we need to decide on whether users would expect that datetime arrays/types (e.g. Python datetime.datetime, pandas.Series with pandas.Timestamp, pyarrow.timestamp, etc) that have non-UTC timezones should be plotted in the same non-UTC timezone, or have an automated conversion to a UTC timezone before plotting.

Vote 🚀 for option 1, and 👀 for option 2. Do comment down below to explain.

@seisman
Copy link
Member

seisman commented Nov 28, 2024

datetime arrays (e.g. pandas.Series with pandas.Timestamp, pyarrow.date32, pyarrow.date64, etc) that have non-UTC timezones

pyarrow.date32/pyarrow.date64 are timezone-unware. As far as I know, only datetime.datetime, datetime.time, pandas.DatetimeTZDtype and pyarrow.timestamp are timezone-aware.


I feel we should go with option 1, so the behavior is consistent with both GMT and the Python world (at least for matplotlib and pandas).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
    {
        "datetime": pd.date_range(start="2024-01-01", periods=10, freq="H").tz_localize("Asia/Shanghai"), 
        "y": np.random.rand(10),
    }
) 
print(df["datetime"])

plt.figure(figsize=(10, 6))
plt.plot(df["datetime"], df["y"], marker="o", linestyle="-", color="blue", label="y values")
plt.title("Time Series Data (Hourly)")
plt.xlabel("Datetime (UTC+8)")
plt.ylabel("Y")
plt.show()

# Use pandas' plotting function
df.plot(
    x="datetime",
    y="y",
    kind="line",
    marker="o",
    title="Time Series Data (Hourly)",
    figsize=(10, 6),
    grid=True,
    xlabel="Datetime (UTC+8)",
    ylabel="Y Values",
    legend=True
)
plt.show()

1
2

@weiji14
Copy link
Member Author

weiji14 commented Nov 28, 2024

datetime arrays (e.g. pandas.Series with pandas.Timestamp, pyarrow.date32, pyarrow.date64, etc) that have non-UTC timezones

pyarrow.date32/pyarrow.date64 are timezone-unware. As far as I know, only datetime.datetime, datetime.time, pandas.DatetimeTZDtype and pyarrow.timestamp are timezone-aware.

Yep, was typing this out too quickly. Have updated the top-post to clarify this.


I feel we should go with option 1, so the behavior is consistent with both GMT and the Python world (at least for matplotlib and pandas).

I'm trying to get some historical context on how matplotlib/pandas has handled timezone plotting before. It seems like matplotlib used to have a plot_date() function which allows setting of a tz parameter, but this was deprecated in matplotlib/matplotlib#18154 and it's unclear/confusing on what the regular matplotlib .plot function does now in terms of timezone handling.

There's also an interesting case on what to do when plotting two or more arrays with different timezones (e.g. UTC and Europe/Berlin time) at matplotlib/matplotlib#8072 (comment). With option 2, everything is converted to UTC so things will be plotted 'correctly'; but with option 1, ignoring the timezones means the data would be plotted in a non-deterministic manner depending on which array (UTC or Europe/Berlin) was passed in first.

@seisman
Copy link
Member

seisman commented Nov 29, 2024

it's unclear/confusing on what the regular matplotlib .plot function does now in terms of timezone handling.

Isn't it clear that matplotlib's plot method is timezone-aware from my example above? The first point is plotted at 01-01 (Asia/Shanghai), not 12-31 (UTC).

with option 1, ignoring the timezones means the data would be plotted in a non-deterministic manner depending on which array (UTC or Europe/Berlin) was passed in first.

This won't be a case in GMT or PyGMT, since arrays can be processed independently. What makes more troubles is the handling of datetime with different timezones in arguments (although people are unlikely and unrecommended to use mixed timezones) like:

region = [
    datetime.datetime(2024, 1, 1, 10, 0, 0, tzinfo=zoneinfo.ZoneInfo('Asia/Shanghai')),
    datetime.datetime(2024, 1, 2, 10, 0, 0, tzinfo=zoneinfo.ZoneInfo('UTC')),
    0, 
    10
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants