ENH/API: resolution inference in vectorized datetime parsing #55564

jbrockmendel · 2023-10-17T15:40:02Z

I'm in the process of implementing resolution inference for vectorized datetime parsing in array_to_datetime. This issue is to track and discuss design issues.

Should we implement this as a breaking change in 3.0 or bugfix in 2.x? We changed the scalar behavior in 2.0.
What to do when mixed resolutions are detected? e.g. pd.to_datetime(["2016-01-01", "2016-01-01T02:03:04.050607"])? ATM in the branch I have going I get the resolution from the first non-NaT entry and apply that everywhere.
i) Are we OK with that value-dependent behavior?
ii) What if we see a np.datetime64("nat", "s") i.e. it has a reso attached. should we infer "s" from that?
What resolution to infer for "today" or "now"? ATM the Timestamp constructor gives "ns", but bc it goes through the stdlib datetime.now, I'm thinking "us" might be more appropriate. (i also don't care that much)
ATM i only handle array_to_datetime, also need to handle array_strptime.
Need to do the same for the timedelta64 paths.

ATM 1455 tests are failing locally. Hopefully these are mostly bc they hard-code ns in "expected".

cc @MarcoGorelli

Issues that I think implementing this will address

The text was updated successfully, but these errors were encountered:

mroeschke · 2023-10-17T16:11:34Z

Initial responses:

I would say bug fix in 2.x. I think a few bug reports are due to discrepancy in behaviors from the scalar vs vectorized cases.
I would assume the least breaking solution is to apply the highest inferred resolution value to all values.
2ii. I would assume "s" would be picked up from np.datetime("nat", "s")
Agree with "us"

jbrockmendel · 2023-10-17T16:18:52Z

highest inferred resolution

This is reasonable, but has some downsides that we need to be clear on:

It can require a 2-pass parsing. Iterate over the values once to find the resolution, then again to parse with that resolution. In cases with uniform resolution we can avoid the second pass.
In cases with e.g. ["2999-01-01", "1970-01-01T02:03:04.000000000"] the reso inferred from the second entry will be "ns" which will cause us to raise OutOfBoundsDatetime on the first entry. But since the tail of the second entry is all zeros, we could use "us" and avoid either raising or rounding. This is a pretty weird corner case though.

jbrockmendel · 2023-10-19T21:42:43Z

Current branch uses highest-encountered reso. Two options for how to do that (current uses the second)

Iterate once to find the highest resolution, then again to store the result with that resolution
Parse with the first encountered resolution. If a higher one is encountered, start over with that resolution. Downside here is we could potentially have to parse up to 4 times if we encounter s, ms, us, ns in that order. Upside is in many cases we will only have to iterate once.

mroeschke · 2023-10-19T22:12:12Z

Is resolution inference expensive? While 2) gives a 1 time parse opportunity, the predictable performance profile of 1) is appealing.

jbrockmendel · 2023-10-19T22:26:55Z

Is resolution inference expensive?

depends on what kind of objects we're dealing with. For strings we have to parse them in order to get the resolution, so we'd be looking at pretty much doubling the current cost. (could probably get single-pass for homogeneous cases).

ATM i still have 230 tests failing locally. Most of these boil down to updating expected which im slowly whittling down, but some of them are surfacing real bugs. Then we'll need to make some API decisions, e.g. ATM on the branch read_excel is producing dt64[s] for .ods files and dt64[us] for everything else (vs dt64[ns] in all cases on main); are we OK with that?

mroeschke · 2023-10-19T22:50:45Z

Then we'll need to make some API decisions, e.g. ATM on the branch read_excel is producing dt64[s] for .ods files and dt64[us] for everything else (vs dt64[ns] in all cases on main); are we OK with that?

I'm OK with those difference if there's some inherent resolution specific limitation to an excel format

MarcoGorelli · 2023-10-20T16:12:08Z

Should we implement this as a breaking change in 3.0 or bugfix in 2.x? We changed the scalar behavior in 2.0.

3.0 break sounds fine

What to do when mixed resolutions are detected? e.g. pd.to_datetime(["2016-01-01", "2016-01-01T02:03:04.050607"])? ATM in the branch I have going I get the resolution from the first non-NaT entry and apply that everywhere.

I think either the following should be fine:

take the resolution from the first non-nat element. for stdlib interpret as 'us', for np.datetime64 take the element's one. let the user overwrite the resolution with some argument (unit?)
regardless of values, default to 'us', but let the user overwrite it with unit

Regarding the second one, you said there were some implementation details getting in the way of this - reckon they're surmountable?

If so, then speaking totally theoretically, I think this is what I'd prefer. Auto-infer for scalars, but don't infer for arrays

But, no strong preference really. Even a double-pass wouldn't be terrible, would be OK with option 1 you've described

jbrockmendel · 2023-10-22T21:42:38Z

@bashtage the branch i have for this currently breaks a bunch of stata tests where resolution no longer round-trips. Do you know what resolution(s) stata stores datetimes in? Thoughts on the desired behavior?

bashtage · 2023-10-22T21:53:45Z

Stata uses a wide variety of resolutions.

This is the key function:

pandas/pandas/io/stata.py

Line 240 in ac5587c

def _stata_elapsed_date_to_datetime_vec(dates: Series, fmt: str) -> Series:

There is a note that once other resolutions are implemented, other than ns, that it would make sense to use the same format as Stata does if possible, e.g., a years date time series would be returned as s yearly pandas datetime.

It has

subsecond (milli?), code tc
days, code td
week of the year (tw)
month (tm)
quarter (tq)
half years (th)
year (ty)

jbrockmendel · 2023-11-02T19:59:37Z

Status Update:

I have a branch that implements resolution inference in DatetimeIndex and to_datetime. ATM 63 tests are failing, down from an initial estimate of "zillions".

Many tests that currently hard-code "ns" in expected are updated to hard-code a new resolution. In some cases this is fine, in others a better solution would be to parametrize over units.

Still surfacing new bugs which I'm addressing as they come up.

Most of the remaining failing tests are in the io tests. This includes 11 for test_stata that may be pre-empted by #55642 and a handful for SAS that can potentially be pre-empted by changing return types to non-nao to better match the SAS format (though see #28047 (comment)). ATM JSON cases look most liable to be problematic to users; will put an example up soonish.

POCs for the cython-level implementation #55741 and #55778.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 17, 2023

This was referenced Oct 18, 2023

DEPR: read_csv keywords: keep_date_col, delim_whitespace #55569

Open

REF: add creso keyword to parse_pydatetime #55579

Merged

This was referenced Oct 21, 2023

REF: implement ParseState #55617

Merged

BUG: pytables with non-nano dt64 #55622

Merged

DEPR: check_datetimelike_compat in tm.assert_foo functions #55638

Open

This was referenced Oct 23, 2023

ENH: read_stata return non-nano #55642

Merged

TST: de-duplicate, be stricter in excel/test_readers.py #55669

Merged

ziadk mentioned this issue Oct 26, 2023

BUG: DatetimeIndex.equals returns false if units mismatch #55694

Open

3 tasks

jbrockmendel added the Datetime Datetime data dtype label Nov 1, 2023

jbrockmendel added Constructors Series/DataFrame/Index/pd.array Constructors Non-Nano datetime64/timedelta64 with non-nanosecond resolution Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2023

jbrockmendel mentioned this issue Nov 15, 2023

BUG: dt64 astype silent overflows #55979

Merged

5 tasks

jbrockmendel mentioned this issue Nov 24, 2023

DEPR: dtype inference in value_counts #56161

Merged

5 tasks

mroeschke closed this as completed in #55901 May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: resolution inference in vectorized datetime parsing #55564

ENH/API: resolution inference in vectorized datetime parsing #55564

jbrockmendel commented Oct 17, 2023 •

edited

Loading

mroeschke commented Oct 17, 2023 •

edited

Loading

jbrockmendel commented Oct 17, 2023

jbrockmendel commented Oct 19, 2023

mroeschke commented Oct 19, 2023

jbrockmendel commented Oct 19, 2023

mroeschke commented Oct 19, 2023

MarcoGorelli commented Oct 20, 2023

jbrockmendel commented Oct 22, 2023

bashtage commented Oct 22, 2023

jbrockmendel commented Nov 2, 2023

ENH/API: resolution inference in vectorized datetime parsing #55564

ENH/API: resolution inference in vectorized datetime parsing #55564

Comments

jbrockmendel commented Oct 17, 2023 • edited Loading

mroeschke commented Oct 17, 2023 • edited Loading

jbrockmendel commented Oct 17, 2023

jbrockmendel commented Oct 19, 2023

mroeschke commented Oct 19, 2023

jbrockmendel commented Oct 19, 2023

mroeschke commented Oct 19, 2023

MarcoGorelli commented Oct 20, 2023

jbrockmendel commented Oct 22, 2023

bashtage commented Oct 22, 2023

jbrockmendel commented Nov 2, 2023

jbrockmendel commented Oct 17, 2023 •

edited

Loading

mroeschke commented Oct 17, 2023 •

edited

Loading