Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify type "category" in read_sql #17862

Closed
sfromm-plm opened this issue Oct 13, 2017 · 4 comments
Closed

Specify type "category" in read_sql #17862

sfromm-plm opened this issue Oct 13, 2017 · 4 comments
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request IO SQL to_sql, read_sql, read_sql_query

Comments

@sfromm-plm
Copy link

Would it be possible to add a feature allowing one to specify that a particular column to be downloaded via read_sql has a particular data type? I'm particularly interested in forcing certain dataframe columns to be categorical, to conserve memory usage.

Yes, I could transform the data after receiving it from the database, but the datasets I'm working with are really large, causing huge memory usage high water marks. (An alternative is to break transfer the data in chunks and then transform the data in each chunk, but that complicates the code and, in my test runs, takes a little longer.)

@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

well if sql were to actually provide a dictionary / category encoded column then this would work. but this is generally an implementation detail of the particular backend and I don't believe is exposed to users.

you might want to look at #17790 which is highly performant.

@jreback jreback closed this as completed Oct 14, 2017
@jreback jreback added Categorical Categorical Data Type IO SQL to_sql, read_sql, read_sql_query labels Oct 14, 2017
@jreback jreback added this to the No action milestone Oct 14, 2017
@jorisvandenbossche
Copy link
Member

We have had several similar enhancement requests, and I think we can add this (just like read_csv has one as well). See #13049, #6798, #17560
In general this is exactly the same as doing astype after reading, but for the case of using chunksize I think it can be easier to do this inside read_sql to every chunk is consistent.

@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Oct 16, 2017
@jorisvandenbossche
Copy link
Member

@sfromm-plm But to be clear, @jreback is correct in that this won't really "work" for categorical type of data, in the sense that it will not give much memory usage gain. As the values are first returned from the database as plain values, and would only afterwards be converted to Categorical.

@sfromm-plm
Copy link
Author

@jorisvandenbossche : Right. @jreback's answer wasn't immediately clear to me, but I settled on what you just said.

Thanks for the replies, guys, even if it wasn't what I was hoping to hear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

3 participants