Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

Closed
deanm0000 opened this issue Apr 28, 2023 · 8 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@deanm0000
Copy link
Collaborator

Problem description

This is somewhat in conjunction with this but each request stands alone.

In this case I'd like to be able to do something like

y, X = df.select("myYcol", "Xcol1", pl.col("Xcol2").someexpressions(), more_expressions).to_np_tuple()

where it'd be the equivalent of

y=df.select('myYcol').to_numpy()
X=df.select("Xcol1", pl.col("Xcol2").someexpressions(), more_expressions).to_numpy()

The distinction would be that whatever the first column is would be the y and the X is all the other columns. Ideally, it'd be in a way that retains the column names, as referenced in the other feature request, but even without that this would be nice.

I can monkey patch this functionality in like this

def to_np_tuple(self):
    y = self.select(self.columns[0]).to_numpy()
    X = self.select(self.columns[1:]).to_numpy()
    return y, X
pl.DataFrame.to_np_tuple=to_np_tuple
del to_np_tuple
@deanm0000 deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 28, 2023
@mcrumiller
Copy link
Contributor

mcrumiller commented Apr 28, 2023

This is a rather unique request that I think would be best with a custom function. Why not make your function a little more general:

def partition_by_columns(df, partitions):
    """
    Divide df into separate numpy arrays using the columns found in partitions. If partitions is a single string,
    return that column as its own partition and the rest of the columns as a separate partition.

    Inputs:
        * df         - polars dataframe
        * partitions - str or list of lists

    Returns: tuple of numpy arrays
    """
    partitions = [partitions] if isinstance(partitions, str) else partitions        
    remaining = list(set(df.columns).difference(partitions))
    return (df.select(columns).to_numpy() for columns in [*partitions, *remaining])

@zundertj
Copy link
Collaborator

On the monkey patching bit, we have an official extension api.

@deanm0000
Copy link
Collaborator Author

@zundertj I know but I usually don't want custom functions in a separate namespace, I just want it at the root level. Incase I'm misusing terms, what I mean is that I don't want to have to do df.custom.to_np_tuple() I just want df.to_np_tuple()

@zundertj
Copy link
Collaborator

zundertj commented May 1, 2023

Ok, as long as you are aware that Polars upgrades may break your monkey patching code. Hence my suggestion to use the extension api.

@deanm0000
Copy link
Collaborator Author

deanm0000 commented May 1, 2023

@mcrumiller appreciate the idea. However, for this purpose, the idea is to sort-of mimic patsy/formulaic's output of a y and X numpy array for input into something like statsmodel. As such, the desired output will always be size 2 so that the y,X=thisfunction() syntax works.

@mcrumiller
Copy link
Contributor

mcrumiller commented May 1, 2023

@deanm0000 try this one:

import polars as pl
import numpy as np

@pl.api.register_dataframe_namespace("partition")
class Partition:
    __slots__ = ["_df"]

    def __init__(self, df: pl.DataFrame):
        self._df = df

    def on_column(self, column) -> tuple[np.ndarray, np.ndarray]:
        """Partition a dataframe based on a single column, and return the tuple (pl.Series, pl.DataFrame)"""
        remaining = [x for x in self._df.columns if x != column]
        return (self._df[column].to_numpy(), self._df.select(remaining).to_numpy())


df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [6, 7, 8, 9, 10],
    'c': [11, 12, 13, 14, 15],
})

y,X = df.partition.on_column('a')

print(y)
print(X)
>>> python test_pl_extension.py
[1 2 3 4 5]
[[ 6 11]
 [ 7 12]
 [ 8 13]
 [ 9 14]
 [10 15]]

@deanm0000
Copy link
Collaborator Author

How about making the default column be None then do

if not column:
    column=self._df.columns[0]

It's probably more performant to make remaining use the list/set/difference approach or just use drop but probably will never matter.

remaining=self._df.drop(column).columns

As an aside what does

    __slots__ = ["_df"]

do?

@mcrumiller
Copy link
Contributor

Sure, those are pretty minor edits and will never be the performance bottleneck, unless you have a dataframe with millions of columns, in which case it probably still won't be the bottleneck.

__slots__ is a pretty standard class optimization. Classes in python typically store their class members in a dictionary, which allows you to add new class attributes on the fly, as in:

class MyClass:
    def __init__(self):
        pass

x = MyClass()
x.my_attribute = 3

print(x.__dict__)
print(x.my_attribute)
{'my_attribute': 3}
3

If you defined __slots__, you explicitly declare the class properties, and attributes cannot be added on the fly. This means python both doesn't have to create a __dict__ and also doesn't create a __weakref__, which is used in garbage collection:

class MyClass:
     __slots__ = ["slot_attribute"]

    def __init__(self):
        pass

x = MyClass()
x.slot_attribute = 3 # ok
x.my_attribute = 3   # uh-oh. We can only use slot_attribute
AttributeError: 'MyClass' object has no attribute 'my_attribute'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants