Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for numpy structured array conversion to and from #8564

Closed
deanm0000 opened this issue Apr 28, 2023 · 8 comments · Fixed by #8628
Closed

Add support for numpy structured array conversion to and from #8564

deanm0000 opened this issue Apr 28, 2023 · 8 comments · Fixed by #8628
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@deanm0000
Copy link
Collaborator

Problem description

A while back there was this question which would have benefited from being able to convert a structured array into a polars df.

Another usage for the to_structured_array use case is feeding it to something like statsmodels so that names and dtypes are preserved.

Right now I can do something like

from numpy.lib.recfunctions import unstructured_to_structured
x_cols=['col1','col2','col3']
X=unstructured_to_structured(df.select(x_cols).to_numpy())
X.dtype.names=x_cols

but then the dtypes are all floats. I don't think that matters in the context of OLS but in other contexts it might be more important to retain dtypes.

Alternatively, other packages like patsy and formulaic produce their own Matrix datatype which is just an np.ndarray but it retains the column names in a way that statsmodels recognizes. I haven't dug into the source code enough to see how that works but that would be enough for me although perhaps not as extensible.

@s-banach
Copy link
Contributor

Can you use df.to_pandas()?

@deanm0000
Copy link
Collaborator Author

In a literal sense, yes, but I'm trying to reduce dependence on pandas.

@tikkanz
Copy link
Contributor

tikkanz commented Apr 29, 2023

Does the answer to this stackoverflow question help?

@zundertj
Copy link
Collaborator

It seems, for Statmodels + formulaic interface, rather easy to create your own wrapper, see working with large datasets in the user guide. Something along the lines of (note: not tested):

class DataSet(dict):
    def __init__(self, df):
        self._df = df

    def __getitem__(self, key):
        try:
            return df[key].to_numpy()
        except:
            raise KeyError

@alexander-beedie alexander-beedie self-assigned this May 1, 2023
@deanm0000
Copy link
Collaborator Author

@tikkanz yes it's helpful although their return trip method uses the structured array that they started from. That being said we can fix that like this:

npdf=np.empty(df.height, [(x, df.head(0).get_column(x).to_numpy().dtype.str) for x in df.columns])
for x in df.columns:
    npdf[x]=df.get_column(x).to_numpy()

@alexander-beedie
Copy link
Collaborator

Done; we'll have native init/export support for numpy structured/record arrays in the upcoming 0.17.12 release 👍

@deanm0000
Copy link
Collaborator Author

What's the syntax to export a structured array? Also, thanks for your efforts, they're much appreciated.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 2, 2023

What's the syntax to export a structured array?

Have a look at the pull request - it's all detailed there ;)
TLDR: df.to_numpy( structured=True )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants