Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

deanm0000 · 2023-04-28T12:06:45Z

Problem description

This is somewhat in conjunction with this but each request stands alone.

In this case I'd like to be able to do something like

y, X = df.select("myYcol", "Xcol1", pl.col("Xcol2").someexpressions(), more_expressions).to_np_tuple()

where it'd be the equivalent of

y=df.select('myYcol').to_numpy()
X=df.select("Xcol1", pl.col("Xcol2").someexpressions(), more_expressions).to_numpy()

The distinction would be that whatever the first column is would be the y and the X is all the other columns. Ideally, it'd be in a way that retains the column names, as referenced in the other feature request, but even without that this would be nice.

I can monkey patch this functionality in like this

def to_np_tuple(self):
    y = self.select(self.columns[0]).to_numpy()
    X = self.select(self.columns[1:]).to_numpy()
    return y, X
pl.DataFrame.to_np_tuple=to_np_tuple
del to_np_tuple

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-04-28T19:54:26Z

This is a rather unique request that I think would be best with a custom function. Why not make your function a little more general:

def partition_by_columns(df, partitions):
    """
    Divide df into separate numpy arrays using the columns found in partitions. If partitions is a single string,
    return that column as its own partition and the rest of the columns as a separate partition.

    Inputs:
        * df         - polars dataframe
        * partitions - str or list of lists

    Returns: tuple of numpy arrays
    """
    partitions = [partitions] if isinstance(partitions, str) else partitions        
    remaining = list(set(df.columns).difference(partitions))
    return (df.select(columns).to_numpy() for columns in [*partitions, *remaining])

zundertj · 2023-04-29T18:15:40Z

On the monkey patching bit, we have an official extension api.

deanm0000 · 2023-05-01T13:14:51Z

@zundertj I know but I usually don't want custom functions in a separate namespace, I just want it at the root level. Incase I'm misusing terms, what I mean is that I don't want to have to do df.custom.to_np_tuple() I just want df.to_np_tuple()

zundertj · 2023-05-01T13:22:01Z

Ok, as long as you are aware that Polars upgrades may break your monkey patching code. Hence my suggestion to use the extension api.

deanm0000 · 2023-05-01T13:32:07Z

@mcrumiller appreciate the idea. However, for this purpose, the idea is to sort-of mimic patsy/formulaic's output of a y and X numpy array for input into something like statsmodel. As such, the desired output will always be size 2 so that the y,X=thisfunction() syntax works.

mcrumiller · 2023-05-01T13:54:37Z

@deanm0000 try this one:

import polars as pl
import numpy as np

@pl.api.register_dataframe_namespace("partition")
class Partition:
    __slots__ = ["_df"]

    def __init__(self, df: pl.DataFrame):
        self._df = df

    def on_column(self, column) -> tuple[np.ndarray, np.ndarray]:
        """Partition a dataframe based on a single column, and return the tuple (pl.Series, pl.DataFrame)"""
        remaining = [x for x in self._df.columns if x != column]
        return (self._df[column].to_numpy(), self._df.select(remaining).to_numpy())


df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [6, 7, 8, 9, 10],
    'c': [11, 12, 13, 14, 15],
})

y,X = df.partition.on_column('a')

print(y)
print(X)

>>> python test_pl_extension.py
[1 2 3 4 5]
[[ 6 11]
 [ 7 12]
 [ 8 13]
 [ 9 14]
 [10 15]]

deanm0000 · 2023-05-01T14:41:10Z

How about making the default column be None then do

if not column:
    column=self._df.columns[0]

It's probably more performant to make remaining use the list/set/difference approach or just use drop but probably will never matter.

remaining=self._df.drop(column).columns

As an aside what does

    __slots__ = ["_df"]

do?

mcrumiller · 2023-05-01T14:55:23Z

Sure, those are pretty minor edits and will never be the performance bottleneck, unless you have a dataframe with millions of columns, in which case it probably still won't be the bottleneck.

__slots__ is a pretty standard class optimization. Classes in python typically store their class members in a dictionary, which allows you to add new class attributes on the fly, as in:

class MyClass:
    def __init__(self):
        pass

x = MyClass()
x.my_attribute = 3

print(x.__dict__)
print(x.my_attribute)

{'my_attribute': 3}
3

If you defined __slots__, you explicitly declare the class properties, and attributes cannot be added on the fly. This means python both doesn't have to create a __dict__ and also doesn't create a __weakref__, which is used in garbage collection:

class MyClass:
     __slots__ = ["slot_attribute"]

    def __init__(self):
        pass

x = MyClass()
x.slot_attribute = 3 # ok
x.my_attribute = 3   # uh-oh. We can only use slot_attribute

AttributeError: 'MyClass' object has no attribute 'my_attribute'

deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 28, 2023

deanm0000 closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

deanm0000 commented Apr 28, 2023

mcrumiller commented Apr 28, 2023 •

edited

Loading

zundertj commented Apr 29, 2023

deanm0000 commented May 1, 2023

zundertj commented May 1, 2023

deanm0000 commented May 1, 2023 •

edited

Loading

mcrumiller commented May 1, 2023 •

edited

Loading

deanm0000 commented May 1, 2023

mcrumiller commented May 1, 2023

Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565

Comments

deanm0000 commented Apr 28, 2023

Problem description

mcrumiller commented Apr 28, 2023 • edited Loading

zundertj commented Apr 29, 2023

deanm0000 commented May 1, 2023

zundertj commented May 1, 2023

deanm0000 commented May 1, 2023 • edited Loading

mcrumiller commented May 1, 2023 • edited Loading

deanm0000 commented May 1, 2023

mcrumiller commented May 1, 2023

mcrumiller commented Apr 28, 2023 •

edited

Loading

deanm0000 commented May 1, 2023 •

edited

Loading

mcrumiller commented May 1, 2023 •

edited

Loading