Handle categorical columns correctly on dask dataframes #653

philippjfr · 2018-09-10T17:16:07Z

In #634 we discovered that when loading data using dask, categorical columns are not necessarily handled correctly. Specifically the code currently calls .head() to load the categories into memory, but this only works correctly for parquet files written with fastparquet, for other data sources the categories can only reliably be computed by loading the whole column into memory.

The appropriate solution is to ensure that categorize or as_known are called on the categorical columns before aggregation. However this can cause significant slowdowns in many scenarios so we have to carefully consider how best to go about implementing this, i.e. consider some caching strategy or strongly suggest that users run this beforehand to ensure that the categories do no have to be inferred on every aggregation.

The text was updated successfully, but these errors were encountered:

jbednar · 2018-09-11T08:54:25Z

Agreed. So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly? Sounds like do need to address that after this coming release, then.

philippjfr · 2018-09-12T16:44:11Z

So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly?

The situation is that the code that tries to infer categories might fail to detect all existing categories unless the data is loaded from a parquet file which was written using fastparquet, which stores the available categories in the metadata. To guarantee correct behavior we would always have to load the entire column into memory to find the unique categories.

jbednar mentioned this issue Dec 19, 2018

Datashader internals to-do list #672

Open

13 tasks

philippjfr mentioned this issue Oct 4, 2019

Optimize dshape_from_dask #798

Merged

philippjfr closed this as completed in #798 Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle categorical columns correctly on dask dataframes #653

Handle categorical columns correctly on dask dataframes #653

philippjfr commented Sep 10, 2018 •

edited

Loading

jbednar commented Sep 11, 2018

philippjfr commented Sep 12, 2018

Handle categorical columns correctly on dask dataframes #653

Handle categorical columns correctly on dask dataframes #653

Comments

philippjfr commented Sep 10, 2018 • edited Loading

jbednar commented Sep 11, 2018

philippjfr commented Sep 12, 2018

philippjfr commented Sep 10, 2018 •

edited

Loading