Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API revert inference of resolution when parsing string in Timestamp? #50704

Closed
MarcoGorelli opened this issue Jan 12, 2023 · 7 comments
Closed
Labels
Datetime Datetime data dtype

Comments

@MarcoGorelli
Copy link
Member

Should #49737 be reverted?

This was brought in the dev call yesterday. It's great that non-nanosecond support is coming along, but it's still not complete and there are still some bugs. Hence, it probably shouldn't happen by default yet, but rather should be opt-in.

Arguably:

In [2]: Timestamp(np.datetime64('2000', 's')).unit
Out[2]: 's'

counts as opt-in, because the user had to go through a numpy object which already had a set resolution. But perhaps it's not the case for

In [7]: Timestamp('2000').unit
Out[7]: 's'

, where users would have been used to receiving 'ns'.

This would be fine once non-nano support is complete, but currently would introduce bugs - e.g. of a fairly common operation I came across this week: #50656. I think @jorisvandenbossche also had an example of this breaking the arrow CI (though from a quick look I couldn't find it - could you please share it?)

@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Jan 12, 2023
@phofl
Copy link
Member

phofl commented Jan 12, 2023

cc @jbrockmendel

@jbrockmendel
Copy link
Member

Is it the Timestamp object or the Series etc having non-nano dtype causing troubles?

If we revert inference on strings, do we also revert support for strings that are out of bounds for nanos? Or do inference only on those cases? IIRC joris and i agreed a while back that the latter was undesirable.

xref #49076 which might alleviate some of the changiness for the Timestamp object. (not a First Issue; I have a branch that stalled out)

@jorisvandenbossche
Copy link
Member

The example from the pyarrow CI where I noticed this, is the change of the return value of pd.Timestamp("2022-01-01").value, which is now a number of seconds instead of a number of nanoseconds, i.e. the #49076 issues you linked.
(this was only in a test for checking the expected result, so that is easy to fix on our side (although we also use .value in actual conversion code, I have to check if that is affected), but in general that is a breaking change)

Is it the Timestamp object or the Series etc having non-nano dtype causing troubles?

I think both. Although I suppose the only aspect about the Timestamp object that might affect users is the .value attribute? (#49076)
But, if this causes your Series to get a non-nano dtype, that does have a wider impact.

@MarcoGorelli
Copy link
Member Author

I only noticed issues when it's part of a DatetimeIndex - and that wouldn't be affected, its reso would still be 'ns':

In [2]: ts = Timestamp('2000')

In [3]: ts.unit
Out[3]: 's'

In [4]: to_datetime([ts])
Out[4]: DatetimeIndex(['2000-01-01'], dtype='datetime64[ns]', freq=None)

If the only issue with Timestamp is .value, then I'm not sure that this would be a strong-enough reason to revert - the suggestion in #49076 (comment) looks fine

@jorisvandenbossche
Copy link
Member

But, if this causes your Series to get a non-nano dtype, that does have a wider impact.

Ah, I was assuming that we would use the unit of the Timestamp object to determine the dtype when creating a Series or column. But if the above is not the case, then it's indeed only .value that has user-facing impact at this moment.

@jbrockmendel
Copy link
Member

Ah, I was assuming that we would use the unit of the Timestamp object to determine the dtype when creating a Series or column

In the relevant cases (object dtype or sequence-without-dtype) we go through array_to_datetime, which is the last big place to implement non-nano support. Not coincidentally, it is the most difficult. It is straightforward to add a reso keyword to array_to_datetime, but doing inference there is tough bc inferred resos might not be homogeneous.

@MarcoGorelli
Copy link
Member Author

resolved in #49076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

4 participants