Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle categorical columns correctly on dask dataframes #653

Closed
philippjfr opened this issue Sep 10, 2018 · 2 comments · Fixed by #798
Closed

Handle categorical columns correctly on dask dataframes #653

philippjfr opened this issue Sep 10, 2018 · 2 comments · Fixed by #798

Comments

@philippjfr
Copy link
Member

philippjfr commented Sep 10, 2018

In #634 we discovered that when loading data using dask, categorical columns are not necessarily handled correctly. Specifically the code currently calls .head() to load the categories into memory, but this only works correctly for parquet files written with fastparquet, for other data sources the categories can only reliably be computed by loading the whole column into memory.

The appropriate solution is to ensure that categorize or as_known are called on the categorical columns before aggregation. However this can cause significant slowdowns in many scenarios so we have to carefully consider how best to go about implementing this, i.e. consider some caching strategy or strongly suggest that users run this beforehand to ensure that the categories do no have to be inferred on every aggregation.

@jbednar
Copy link
Member

jbednar commented Sep 11, 2018

Agreed. So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly? Sounds like do need to address that after this coming release, then.

@philippjfr
Copy link
Member Author

So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly?

The situation is that the code that tries to infer categories might fail to detect all existing categories unless the data is loaded from a parquet file which was written using fastparquet, which stores the available categories in the metadata. To guarantee correct behavior we would always have to load the entire column into memory to find the unique categories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants