Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] DataFrame constructors #39991

Merged

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

In this PR, I am trying to ensure what when we construct a DataFrame, we directly construct the proper manager (depending on the option settings) instead of converting at the end of the constructor (as is done now). To that end, I am passing through a keyword to indicate the type of the manager to be created to the different internal construction methods like arrays_to_mgr etc.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation Constructors Series/DataFrame/Index/pd.array Constructors labels Feb 23, 2021
@jorisvandenbossche
Copy link
Member Author

@jbrockmendel other comments?

@jbrockmendel
Copy link
Member

lgtm cc @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well

pandas/core/frame.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
mgr = self._init_mgr(
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
)

elif isinstance(data, dict):
mgr = init_dict(data, index, columns, dtype=dtype)
mgr = init_dict(data, index, columns, dtype=dtype, typ=manager)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename typ -> manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is the same keyword as I use for arrays_to_mgr, mgr_to_mgr, etc, where I think the typ keyword makes sense (it's already clear from the name of the function that it is creating a manager).

Alternatively, we could make the init_dict function name more explicit (and consistent with the others) and eg rename to dict_to_mgr. I am happy to do this in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah +1 on the rename (followon ok ) and anything to make the keyword more obvious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #40074

pandas/core/internals/array_manager.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

@jreback more answers / comments here?

@@ -107,7 +111,8 @@ def arrays_to_mgr(

# don't force copy because getting jammed in an ndarray anyway
arrays = _homogenize(arrays, index, dtype)

if typ == "array":
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this unconditionally. tradeoff: slightly less complex code here, perf of unecessary wrapping/unwrapping in DatetimeBlock/TimedeltaBlock._maybe_coerce_values (for now, until upcoming PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then leave that until it is useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this inside e.g. in ArrayManager constructor (or add a constructor on L127), prob should do that anyhow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a constructor on L127

What do you mean exactly with that?
We already use the main constructor ArrayManager(arrays, [index, columns]) on L127 because that constructor already works from "arrays" (in constrast to the BlockManager, which needs to consolidate arrays etc, so therefore there is a helper create_blockmanager_from_arrays, but for ArrayManager this is exactly the __init__ already)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the suggestion is to move the call on L115 to just before the return ArrayManager on L127. i think thatd be a small improvement, not a big deal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put it here because, in theory, I think this wrapping only needs to be done with the default of verify_integrity=True (so you have a way to skip it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean on line L127 this should call a method inside internal and not directly calling the manager. Then you can handle the special cases. on L110 inside that function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want me to write this function?

def create_array_manager_from_arrays(arrays, axes, verify_integrity=True):
    if verify_integrity:
        arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays]
    return ArrayManaget(arrays, axes)

(IMO this just splits the handling of verify_integrity to multiple places (as mentioned above, that's the reason that I put this wrapping where it is located now), but I don't care if this makes you happy)

this should call a method inside internal

Note you are commenting on a file that already is inside internals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its L110 which needs to be handled in array_manager.py and not here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved inside array_manager.py

@@ -107,7 +111,8 @@ def arrays_to_mgr(

# don't force copy because getting jammed in an ndarray anyway
arrays = _homogenize(arrays, index, dtype)

if typ == "array":
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this inside e.g. in ArrayManager constructor (or add a constructor on L127), prob should do that anyhow.

@jorisvandenbossche
Copy link
Member Author

All good here? (this is blocking a few other PRs that are mostly ready: #39700, #40050)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a rebase

@@ -107,7 +111,8 @@ def arrays_to_mgr(

# don't force copy because getting jammed in an ndarray anyway
arrays = _homogenize(arrays, index, dtype)

if typ == "array":
arrays = [ensure_wrapped_if_datetimelike(arr) for arr in arrays]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean on line L127 this should call a method inside internal and not directly calling the manager. Then you can handle the special cases. on L110 inside that function.

@jorisvandenbossche
Copy link
Member Author

Moving the datetime wrapping into the ArrayManager constructor revealed some other issues, so first doing a precursor for this here -> #40147

@jorisvandenbossche
Copy link
Member Author

#40147 is merged, so updated this (I think the only remaining discussion item was about where the datetime wrapping happened, which is resolved by #40147)

@jreback jreback added this to the 1.3 milestone Mar 2, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants