[ArrayManager] Ensure to store datetimelike data as DatetimeArray/TimedeltaArray (and not ndarray) #40147

jorisvandenbossche · 2021-03-01T14:12:51Z

Pre-cursor for #39991

Currently we didn't really check that we were consistently storing datetimelike data as the EA (DatetimeArray, TimedeltaArray) or as ndarrray. Ensuring this in the ArrayManager constructor turns up a few failures.

I think it will be the easiest to always store them as EA and not as ndarray (eg for many other operations, we otherwise would wrap them in the EA anyway).

…ay/TimedeltaArray (and not ndarray)

jorisvandenbossche · 2021-03-01T14:15:52Z

pandas/core/internals/array_manager.py

+        elif is_datetime64_ns_dtype(dtype):
+            result = DatetimeArray._from_sequence(result, dtype=dtype)._data
+        elif is_timedelta64_ns_dtype(dtype):
+            result = TimedeltaArray._from_sequence(result, dtype=dtype)._data


The reason I am doing this here is because this fails:

In [4]: np.array([pd.NaT], dtype="M8[ns]") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-4-99d42e913a1c> in <module> ----> 1 np.array([pd.NaT], dtype="M8[ns]") ValueError: cannot convert float NaN to integer

(which is what the np.array([arr[loc] for arr in self.arrays], dtype=temp_dtype) above can be doing if the resulting dtype is M/m8)

@jbrockmendel do you know if that's something we do elsewhere as well / there is some existing code for this?

IIRC we can get BM to mess up if we get here with non-consolidated all-td64 blocks.

I think the thing to do (also similar change in the BM method) is to define result on L737 as a list, only wrap with np.array if none of these conditions hold

IIRC we can get BM to mess up if we get here with non-consolidated all-td64 blocks.

You can't get here in that case, though, since this is an ArrayManager method. And with the current BM you don't have this problem since it doesn't store DatetimeArray/TimedeltaArray with numpy dtypes.

I think the thing to do (also similar change in the BM method) is to define result on L737 as a list, only wrap with np.array if none of these conditions hold

Thanks, that's a good idea

Updated with this simplification.

…torage

jorisvandenbossche · 2021-03-02T18:36:16Z

@jbrockmendel more comments here?

jbrockmendel · 2021-03-02T20:19:19Z

pandas/core/internals/array_manager.py

+        # for datetime64/timedelta64, the np.ndarray constructor cannot handle pd.NaT
+        elif is_datetime64_ns_dtype(dtype):
+            result = DatetimeArray._from_sequence(values, dtype=dtype)._data
+        elif is_timedelta64_ns_dtype(dtype):


we have a little-used is_ea_or_datetimelike_dtype, could use an analogous helper to get DatetimeArray/TimedeltaArray in these cases (not for this PR)

jbrockmendel · 2021-03-02T20:19:57Z

pandas/core/internals/array_manager.py

+        elif is_timedelta64_ns_dtype(dtype):
+            result = TimedeltaArray._from_sequence(values, dtype=dtype)._data
+        else:
+            result = np.array(values, dtype=dtype)


did you check if this is relevant for the BlockManager case?

Yes, see my (somewhat) answer at #40147 (comment).
But moreover, in the BlockManager method, it assigns slices from the Block values into the resulting array:

pandas/pandas/core/internals/managers.py

Lines 978 to 982 in b835ca2

for blk in self.blocks:

# Such assignment may incorrectly coerce NaT to None

# result[blk.mgr_locs] = blk._slice((slice(None), loc))

for i, rl in enumerate(blk.mgr_locs):

result[rl] = blk.iget((i, loc))

So that's quite different as the code here, and the idea of first keeping it in a list doesn't really apply.

jbrockmendel · 2021-03-02T20:59:23Z

LGTM cc @jreback

jorisvandenbossche · 2021-03-02T21:15:50Z

@jbrockmendel thanks for the review

I updated #39991 now based on this

[ArrayManager] Ensure to store datetime/timedelta data as DatetimeArr…

1001a16

…ay/TimedeltaArray (and not ndarray)

jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation labels Mar 1, 2021

jorisvandenbossche requested a review from jbrockmendel March 1, 2021 14:12

jorisvandenbossche commented Mar 1, 2021

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 1, 2021

[ArrayManager] DataFrame constructors #39991

Merged

jorisvandenbossche added 2 commits March 2, 2021 08:24

simplify

d848205

Merge remote-tracking branch 'upstream/master' into am-datetimelike-s…

6c43282

…torage

jbrockmendel reviewed Mar 2, 2021

View reviewed changes

jorisvandenbossche merged commit 4f18821 into pandas-dev:master Mar 2, 2021

jorisvandenbossche deleted the am-datetimelike-storage branch March 2, 2021 21:08

jorisvandenbossche added this to the 1.3 milestone Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrayManager] Ensure to store datetimelike data as DatetimeArray/TimedeltaArray (and not ndarray) #40147

[ArrayManager] Ensure to store datetimelike data as DatetimeArray/TimedeltaArray (and not ndarray) #40147

jorisvandenbossche commented Mar 1, 2021

jorisvandenbossche Mar 1, 2021

jbrockmendel Mar 1, 2021

jorisvandenbossche Mar 2, 2021

jorisvandenbossche Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jbrockmendel Mar 2, 2021

jbrockmendel Mar 2, 2021

jorisvandenbossche Mar 2, 2021

jbrockmendel Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

	for blk in self.blocks:
	# Such assignment may incorrectly coerce NaT to None
	# result[blk.mgr_locs] = blk._slice((slice(None), loc))
	for i, rl in enumerate(blk.mgr_locs):
	result[rl] = blk.iget((i, loc))

[ArrayManager] Ensure to store datetimelike data as DatetimeArray/TimedeltaArray (and not ndarray) #40147

[ArrayManager] Ensure to store datetimelike data as DatetimeArray/TimedeltaArray (and not ndarray) #40147

Conversation

jorisvandenbossche commented Mar 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021