-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/API: resolution inference in vectorized datetime parsing #55564
Comments
Initial responses:
|
This is reasonable, but has some downsides that we need to be clear on:
|
Current branch uses highest-encountered reso. Two options for how to do that (current uses the second)
|
Is resolution inference expensive? While 2) gives a 1 time parse opportunity, the predictable performance profile of 1) is appealing. |
depends on what kind of objects we're dealing with. For strings we have to parse them in order to get the resolution, so we'd be looking at pretty much doubling the current cost. (could probably get single-pass for homogeneous cases). ATM i still have 230 tests failing locally. Most of these boil down to updating |
I'm OK with those difference if there's some inherent resolution specific limitation to an excel format |
3.0 break sounds fine
I think either the following should be fine:
Regarding the second one, you said there were some implementation details getting in the way of this - reckon they're surmountable? If so, then speaking totally theoretically, I think this is what I'd prefer. Auto-infer for scalars, but don't infer for arrays But, no strong preference really. Even a double-pass wouldn't be terrible, would be OK with option 1 you've described |
@bashtage the branch i have for this currently breaks a bunch of stata tests where resolution no longer round-trips. Do you know what resolution(s) stata stores datetimes in? Thoughts on the desired behavior? |
Stata uses a wide variety of resolutions. This is the key function: Line 240 in ac5587c
There is a note that once other resolutions are implemented, other than ns, that it would make sense to use the same format as Stata does if possible, e.g., a years date time series would be returned as s yearly pandas datetime. It has
|
Status Update: I have a branch that implements resolution inference in DatetimeIndex and to_datetime. ATM 63 tests are failing, down from an initial estimate of "zillions". Many tests that currently hard-code "ns" in Still surfacing new bugs which I'm addressing as they come up. Most of the remaining failing tests are in the io tests. This includes 11 for test_stata that may be pre-empted by #55642 and a handful for SAS that can potentially be pre-empted by changing return types to non-nao to better match the SAS format (though see #28047 (comment)). ATM JSON cases look most liable to be problematic to users; will put an example up soonish. |
I'm in the process of implementing resolution inference for vectorized datetime parsing in array_to_datetime. This issue is to track and discuss design issues.
pd.to_datetime(["2016-01-01", "2016-01-01T02:03:04.050607"])
? ATM in the branch I have going I get the resolution from the first non-NaT entry and apply that everywhere.i) Are we OK with that value-dependent behavior?
ii) What if we see a np.datetime64("nat", "s") i.e. it has a reso attached. should we infer "s" from that?
ATM 1455 tests are failing locally. Hopefully these are mostly bc they hard-code ns in "expected".
cc @MarcoGorelli
Issues that I think implementing this will address
The text was updated successfully, but these errors were encountered: