Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize dshape_from_dask #798

Merged
merged 3 commits into from
Oct 4, 2019
Merged

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented Oct 4, 2019

@philippjfr philippjfr added the wip label Oct 4, 2019
@jonmmease
Copy link
Collaborator

Yeah, I think this is a good approach. I expect it will, but can you double check that it works for categorical columns?

@philippjfr
Copy link
Member Author

Will do, although I don't think we handle those correctly right now either, see #653

@philippjfr
Copy link
Member Author

philippjfr commented Oct 4, 2019

Okay, I've pushed a fix to handle the case where the categories aren't known. Just wondering whether maybe we should modify the columns on the dataframe inplace so that the categories only need to be inferred once, i.e. something like:

for c in df.columns:
    col = df[c]
    if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype):
        df[c] = col.cat.as_known()

@philippjfr
Copy link
Member Author

philippjfr commented Oct 4, 2019

It seems like the original guidance I got from dask folks wasn't quite correct, calling .head() does seem to load all the categories and not just the categories that are actually present in the head. So #653 was not actually an issue as far as I can tell. Explicitly calling as_known is therefore only needed for the new approach. Calling that is faster than calling .head() if there are one or two categorical columns but ends up being slower when there are more. It is however much faster if there are no categorical columns. So maybe to get the best of both worlds we should use the old approach for more than one categorical columns and use the new approach for zero to one categorical columns. Does that sound reasonable?

@jonmmease
Copy link
Collaborator

The 1-2 seems a bit arbitrary and I would think it wouldn't always be the optimal cutoff.

If you call head on the subset of columns that are categorical do you get comparable performance for the 1-2 category case? e.g. df[[cat_cols]].head(). If so, then I think I'd prefer the approach of always calling head on this subset (but don't call it at all if there are no categoricals), and then handle the rest of the columns as in your original PR here.

@philippjfr
Copy link
Member Author

If you call head on the subset of columns that are categorical do you get comparable performance for the 1-2 category case?

No, the overhead of .head doesn't seem to reduce much when selecting a subset of columns (in fact it increases in some cases). So this approach would be much worse than even the current performance.

@jonmmease
Copy link
Collaborator

Ok, then your approach seems the most reasonable 🙂

@philippjfr
Copy link
Member Author

I've made the cut-off zero to one categorical columns, because those are the cases where it was consistently faster.

@philippjfr philippjfr force-pushed the philippjfr/dask_datashape_optimization branch from 01ecc81 to bc700af Compare October 4, 2019 12:17
@philippjfr philippjfr force-pushed the philippjfr/dask_datashape_optimization branch from bc700af to aa286d6 Compare October 4, 2019 12:47
@jbednar
Copy link
Member

jbednar commented Oct 4, 2019

That all sounds good to me.

if len(cat_columns) > 1:
# If there is more than one categorical column it is faster
# to compute the df.head() which will contain all categories
return datashape.var * dshape_from_pandas(df.head()).measure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose it will get any faster if you specify df.head(1), but maybe it's more explicit? No strong opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, it's slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm; that's very strange. Is df.head(5) (or whatever the default is) also slower? Seems fishy!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too surprising, it still has load the first partition and the network overhead difference between 1 and 5 rows is tiny. So the only difference is that it adds another task to the graph which runs the iloc.

@philippjfr philippjfr merged commit 4fad824 into master Oct 4, 2019
@@ -361,6 +361,8 @@ def dshape_from_pandas_helper(col):
dataframe.
"""
if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype):
if not getattr(col.cat, 'known', True):
col = col.cat.as_known()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially expensive, and since you're dropping the reference this is work that is lost. I'm not sure if there's a better way, users interested in performance should provide dataframes with known categories in beforehand.

Copy link
Member Author

@philippjfr philippjfr Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially expensive, and since you're dropping the reference this is work that is lost.

Right I considered assigning back to the original dataframe but that would be a bit too magical. I think if we can get memoization to work I won't feel so bad about this.

While I've got you here, I remember you telling me that calling .head() was not usually sufficient for inferring the categories and that only fastparquet loads and stores the categories in metadata. Having played around with loading a variety of different datasets from csv, parquet loaded via pyarrow and simply converting a pandas dataframe to a dask dataframe, the .head() seems to have consistently returned all available categories even if they are not present in .head(). Is that just a quirk of the examples I've tried or can this be relied on after all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just quirk of the examples you've tried.

In [38]: df = pd.DataFrame({'a': ['a', 'b', 'b']})

In [39]: df2 = pd.DataFrame({'a': ['b', 'b', 'c']})

In [40]: ddf = dd.concat([df, df2])

In [41]: ddf2 = ddf.astype({'a': 'category'})

In [42]: ddf2
Out[42]:
Dask DataFrame Structure:
                               a
npartitions=2
               category[unknown]
                             ...
                             ...
Dask Name: astype, 6 tasks

In [43]: ddf2.head(1).a.cat.categories
Out[43]: Index(['a', 'b'], dtype='object')

In [44]: ddf2.a.cat.as_known().cat.categories
Out[44]: Index(['a', 'b', 'c'], dtype='object')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maximlt maximlt deleted the philippjfr/dask_datashape_optimization branch December 25, 2021 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memoize or otherwise optimize dshape_from_dask Handle categorical columns correctly on dask dataframes
4 participants