-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to return two numpy arrays where first one is the first column and the second is all other columns #8565
Comments
This is a rather unique request that I think would be best with a custom function. Why not make your function a little more general: def partition_by_columns(df, partitions):
"""
Divide df into separate numpy arrays using the columns found in partitions. If partitions is a single string,
return that column as its own partition and the rest of the columns as a separate partition.
Inputs:
* df - polars dataframe
* partitions - str or list of lists
Returns: tuple of numpy arrays
"""
partitions = [partitions] if isinstance(partitions, str) else partitions
remaining = list(set(df.columns).difference(partitions))
return (df.select(columns).to_numpy() for columns in [*partitions, *remaining]) |
On the monkey patching bit, we have an official extension api. |
@zundertj I know but I usually don't want custom functions in a separate namespace, I just want it at the root level. Incase I'm misusing terms, what I mean is that I don't want to have to do |
Ok, as long as you are aware that Polars upgrades may break your monkey patching code. Hence my suggestion to use the extension api. |
@mcrumiller appreciate the idea. However, for this purpose, the idea is to sort-of mimic patsy/formulaic's output of a y and X numpy array for input into something like statsmodel. As such, the desired output will always be size 2 so that the |
@deanm0000 try this one: import polars as pl
import numpy as np
@pl.api.register_dataframe_namespace("partition")
class Partition:
__slots__ = ["_df"]
def __init__(self, df: pl.DataFrame):
self._df = df
def on_column(self, column) -> tuple[np.ndarray, np.ndarray]:
"""Partition a dataframe based on a single column, and return the tuple (pl.Series, pl.DataFrame)"""
remaining = [x for x in self._df.columns if x != column]
return (self._df[column].to_numpy(), self._df.select(remaining).to_numpy())
df = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [6, 7, 8, 9, 10],
'c': [11, 12, 13, 14, 15],
})
y,X = df.partition.on_column('a')
print(y)
print(X)
|
How about making the default
It's probably more performant to make remaining use the list/set/difference approach or just use
As an aside what does
do? |
Sure, those are pretty minor edits and will never be the performance bottleneck, unless you have a dataframe with millions of columns, in which case it probably still won't be the bottleneck.
class MyClass:
def __init__(self):
pass
x = MyClass()
x.my_attribute = 3
print(x.__dict__)
print(x.my_attribute)
If you defined class MyClass:
__slots__ = ["slot_attribute"]
def __init__(self):
pass
x = MyClass()
x.slot_attribute = 3 # ok
x.my_attribute = 3 # uh-oh. We can only use slot_attribute
|
Problem description
This is somewhat in conjunction with this but each request stands alone.
In this case I'd like to be able to do something like
where it'd be the equivalent of
The distinction would be that whatever the first column is would be the y and the X is all the other columns. Ideally, it'd be in a way that retains the column names, as referenced in the other feature request, but even without that this would be nice.
I can monkey patch this functionality in like this
The text was updated successfully, but these errors were encountered: