-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] DataFrame constructors #39991
[ArrayManager] DataFrame constructors #39991
Conversation
@jbrockmendel other comments? |
lgtm cc @jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rebase as well
pandas/core/frame.py
Outdated
mgr = self._init_mgr( | ||
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy | ||
) | ||
|
||
elif isinstance(data, dict): | ||
mgr = init_dict(data, index, columns, dtype=dtype) | ||
mgr = init_dict(data, index, columns, dtype=dtype, typ=manager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename typ -> manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is the same keyword as I use for arrays_to_mgr
, mgr_to_mgr
, etc, where I think the typ
keyword makes sense (it's already clear from the name of the function that it is creating a manager).
Alternatively, we could make the init_dict
function name more explicit (and consistent with the others) and eg rename to dict_to_mgr
. I am happy to do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah +1 on the rename (followon ok ) and anything to make the keyword more obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #40074
@jreback more answers / comments here? |
@@ -107,7 +111,8 @@ def arrays_to_mgr( | |||
|
|||
# don't force copy because getting jammed in an ndarray anyway | |||
arrays = _homogenize(arrays, index, dtype) | |||
|
|||
if typ == "array": | |||
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do this unconditionally. tradeoff: slightly less complex code here, perf of unecessary wrapping/unwrapping in DatetimeBlock/TimedeltaBlock._maybe_coerce_values (for now, until upcoming PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would then leave that until it is useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this inside e.g. in ArrayManager constructor (or add a constructor on L127), prob should do that anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a constructor on L127
What do you mean exactly with that?
We already use the main constructor ArrayManager(arrays, [index, columns])
on L127 because that constructor already works from "arrays" (in constrast to the BlockManager, which needs to consolidate arrays etc, so therefore there is a helper create_blockmanager_from_arrays
, but for ArrayManager this is exactly the __init__
already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the suggestion is to move the call on L115 to just before the return ArrayManager
on L127. i think thatd be a small improvement, not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally put it here because, in theory, I think this wrapping only needs to be done with the default of verify_integrity=True
(so you have a way to skip it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean on line L127 this should call a method inside internal and not directly calling the manager. Then you can handle the special cases. on L110 inside that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want me to write this function?
def create_array_manager_from_arrays(arrays, axes, verify_integrity=True):
if verify_integrity:
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays]
return ArrayManaget(arrays, axes)
(IMO this just splits the handling of verify_integrity
to multiple places (as mentioned above, that's the reason that I put this wrapping where it is located now), but I don't care if this makes you happy)
this should call a method inside internal
Note you are commenting on a file that already is inside internals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its L110 which needs to be handled in array_manager.py and not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved inside array_manager.py
@@ -107,7 +111,8 @@ def arrays_to_mgr( | |||
|
|||
# don't force copy because getting jammed in an ndarray anyway | |||
arrays = _homogenize(arrays, index, dtype) | |||
|
|||
if typ == "array": | |||
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this inside e.g. in ArrayManager constructor (or add a constructor on L127), prob should do that anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a rebase
@@ -107,7 +111,8 @@ def arrays_to_mgr( | |||
|
|||
# don't force copy because getting jammed in an ndarray anyway | |||
arrays = _homogenize(arrays, index, dtype) | |||
|
|||
if typ == "array": | |||
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean on line L127 this should call a method inside internal and not directly calling the manager. Then you can handle the special cases. on L110 inside that function.
Moving the datetime wrapping into the ArrayManager constructor revealed some other issues, so first doing a precursor for this here -> #40147 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
xref #39146
In this PR, I am trying to ensure what when we construct a DataFrame, we directly construct the proper manager (depending on the option settings) instead of converting at the end of the constructor (as is done now). To that end, I am passing through a keyword to indicate the type of the manager to be created to the different internal construction methods like
arrays_to_mgr
etc.