You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #634 we discovered that when loading data using dask, categorical columns are not necessarily handled correctly. Specifically the code currently calls .head() to load the categories into memory, but this only works correctly for parquet files written with fastparquet, for other data sources the categories can only reliably be computed by loading the whole column into memory.
The appropriate solution is to ensure that categorize or as_known are called on the categorical columns before aggregation. However this can cause significant slowdowns in many scenarios so we have to carefully consider how best to go about implementing this, i.e. consider some caching strategy or strongly suggest that users run this beforehand to ensure that the categories do no have to be inferred on every aggregation.
The text was updated successfully, but these errors were encountered:
Agreed. So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly? Sounds like do need to address that after this coming release, then.
So the current situation is that with fastparquet files, the results will be fast, but other parquet files will fail to load properly?
The situation is that the code that tries to infer categories might fail to detect all existing categories unless the data is loaded from a parquet file which was written using fastparquet, which stores the available categories in the metadata. To guarantee correct behavior we would always have to load the entire column into memory to find the unique categories.
In #634 we discovered that when loading data using dask, categorical columns are not necessarily handled correctly. Specifically the code currently calls
.head()
to load the categories into memory, but this only works correctly for parquet files written with fastparquet, for other data sources the categories can only reliably be computed by loading the whole column into memory.The appropriate solution is to ensure that
categorize
oras_known
are called on the categorical columns before aggregation. However this can cause significant slowdowns in many scenarios so we have to carefully consider how best to go about implementing this, i.e. consider some caching strategy or strongly suggest that users run this beforehand to ensure that the categories do no have to be inferred on every aggregation.The text was updated successfully, but these errors were encountered: