Optimize dshape_from_dask #798

philippjfr · 2019-10-04T11:14:35Z

jonmmease · 2019-10-04T11:20:00Z

Yeah, I think this is a good approach. I expect it will, but can you double check that it works for categorical columns?

philippjfr · 2019-10-04T11:22:30Z

Will do, although I don't think we handle those correctly right now either, see #653

philippjfr · 2019-10-04T11:35:50Z

Okay, I've pushed a fix to handle the case where the categories aren't known. Just wondering whether maybe we should modify the columns on the dataframe inplace so that the categories only need to be inferred once, i.e. something like:

for c in df.columns:
    col = df[c]
    if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype):
        df[c] = col.cat.as_known()

philippjfr · 2019-10-04T11:51:52Z

It seems like the original guidance I got from dask folks wasn't quite correct, calling .head() does seem to load all the categories and not just the categories that are actually present in the head. So #653 was not actually an issue as far as I can tell. Explicitly calling as_known is therefore only needed for the new approach. Calling that is faster than calling .head() if there are one or two categorical columns but ends up being slower when there are more. It is however much faster if there are no categorical columns. So maybe to get the best of both worlds we should use the old approach for more than one categorical columns and use the new approach for zero to one categorical columns. Does that sound reasonable?

jonmmease · 2019-10-04T11:59:20Z

The 1-2 seems a bit arbitrary and I would think it wouldn't always be the optimal cutoff.

If you call head on the subset of columns that are categorical do you get comparable performance for the 1-2 category case? e.g. df[[cat_cols]].head(). If so, then I think I'd prefer the approach of always calling head on this subset (but don't call it at all if there are no categoricals), and then handle the rest of the columns as in your original PR here.

philippjfr · 2019-10-04T12:02:04Z

If you call head on the subset of columns that are categorical do you get comparable performance for the 1-2 category case?

No, the overhead of .head doesn't seem to reduce much when selecting a subset of columns (in fact it increases in some cases). So this approach would be much worse than even the current performance.

jonmmease · 2019-10-04T12:02:49Z

Ok, then your approach seems the most reasonable 🙂

philippjfr · 2019-10-04T12:03:25Z

I've made the cut-off zero to one categorical columns, because those are the cases where it was consistently faster.

jbednar · 2019-10-04T13:38:18Z

That all sounds good to me.

jbednar · 2019-10-04T13:41:35Z

datashader/utils.py

+    if len(cat_columns) > 1:
+        # If there is more than one categorical column it is faster
+        # to compute the df.head() which will contain all categories
+        return datashape.var * dshape_from_pandas(df.head()).measure


I don't suppose it will get any faster if you specify df.head(1), but maybe it's more explicit? No strong opinion.

I tried, it's slower.

Hmm; that's very strange. Is df.head(5) (or whatever the default is) also slower? Seems fishy!

Not too surprising, it still has load the first partition and the network overhead difference between 1 and 5 rows is tiny. So the only difference is that it adds another task to the graph which runs the iloc.

jcrist · 2019-10-04T18:11:26Z

datashader/utils.py

@@ -361,6 +361,8 @@ def dshape_from_pandas_helper(col):
    dataframe.
    """
    if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype):
+        if not getattr(col.cat, 'known', True):
+            col = col.cat.as_known()


This is potentially expensive, and since you're dropping the reference this is work that is lost. I'm not sure if there's a better way, users interested in performance should provide dataframes with known categories in beforehand.

This is potentially expensive, and since you're dropping the reference this is work that is lost.

Right I considered assigning back to the original dataframe but that would be a bit too magical. I think if we can get memoization to work I won't feel so bad about this.

While I've got you here, I remember you telling me that calling .head() was not usually sufficient for inferring the categories and that only fastparquet loads and stores the categories in metadata. Having played around with loading a variety of different datasets from csv, parquet loaded via pyarrow and simply converting a pandas dataframe to a dask dataframe, the .head() seems to have consistently returned all available categories even if they are not present in .head(). Is that just a quirk of the examples I've tried or can this be relied on after all?

That's just quirk of the examples you've tried.

In [38]: df = pd.DataFrame({'a': ['a', 'b', 'b']}) In [39]: df2 = pd.DataFrame({'a': ['b', 'b', 'c']}) In [40]: ddf = dd.concat([df, df2]) In [41]: ddf2 = ddf.astype({'a': 'category'}) In [42]: ddf2 Out[42]: Dask DataFrame Structure: a npartitions=2 category[unknown] ... ... Dask Name: astype, 6 tasks In [43]: ddf2.head(1).a.cat.categories Out[43]: Index(['a', 'b'], dtype='object') In [44]: ddf2.a.cat.as_known().cat.categories Out[44]: Index(['a', 'b', 'c'], dtype='object')

Optimize dshape_from_dask

7d0d537

philippjfr added the wip label Oct 4, 2019

Fixed unknown category handling

6c784db

philippjfr force-pushed the philippjfr/dask_datashape_optimization branch from 01ecc81 to bc700af Compare October 4, 2019 12:17

Fall back to old way of compute dask datashape

aa286d6

philippjfr force-pushed the philippjfr/dask_datashape_optimization branch from bc700af to aa286d6 Compare October 4, 2019 12:47

jbednar approved these changes Oct 4, 2019

View reviewed changes

philippjfr merged commit 4fad824 into master Oct 4, 2019

jcrist reviewed Oct 4, 2019

View reviewed changes

maximlt deleted the philippjfr/dask_datashape_optimization branch December 25, 2021 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize dshape_from_dask #798

Optimize dshape_from_dask #798

philippjfr commented Oct 4, 2019 •

edited

Loading

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

philippjfr commented Oct 4, 2019 •

edited

Loading

philippjfr commented Oct 4, 2019 •

edited

Loading

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

jbednar commented Oct 4, 2019

jbednar Oct 4, 2019

philippjfr Oct 4, 2019

jbednar Oct 4, 2019

philippjfr Oct 4, 2019

jcrist Oct 4, 2019

philippjfr Oct 4, 2019 •

edited

Loading

jcrist Oct 4, 2019

philippjfr Oct 4, 2019

Optimize dshape_from_dask #798

Optimize dshape_from_dask #798

Conversation

philippjfr commented Oct 4, 2019 • edited Loading

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

philippjfr commented Oct 4, 2019 • edited Loading

philippjfr commented Oct 4, 2019 • edited Loading

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

jonmmease commented Oct 4, 2019

philippjfr commented Oct 4, 2019

jbednar commented Oct 4, 2019

jbednar Oct 4, 2019

Choose a reason for hiding this comment

philippjfr Oct 4, 2019

Choose a reason for hiding this comment

jbednar Oct 4, 2019

Choose a reason for hiding this comment

philippjfr Oct 4, 2019

Choose a reason for hiding this comment

jcrist Oct 4, 2019

Choose a reason for hiding this comment

philippjfr Oct 4, 2019 • edited Loading

Choose a reason for hiding this comment

jcrist Oct 4, 2019

Choose a reason for hiding this comment

philippjfr Oct 4, 2019

Choose a reason for hiding this comment

philippjfr commented Oct 4, 2019 •

edited

Loading

philippjfr commented Oct 4, 2019 •

edited

Loading

philippjfr commented Oct 4, 2019 •

edited

Loading

philippjfr Oct 4, 2019 •

edited

Loading