-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize dshape_from_dask #798
Conversation
philippjfr
commented
Oct 4, 2019
•
edited
Loading
edited
- Fixes Memoize or otherwise optimize dshape_from_dask #797
- Closes Handle categorical columns correctly on dask dataframes #653
Yeah, I think this is a good approach. I expect it will, but can you double check that it works for categorical columns? |
Will do, although I don't think we handle those correctly right now either, see #653 |
Okay, I've pushed a fix to handle the case where the categories aren't known. Just wondering whether maybe we should modify the columns on the dataframe inplace so that the categories only need to be inferred once, i.e. something like: for c in df.columns:
col = df[c]
if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype):
df[c] = col.cat.as_known() |
It seems like the original guidance I got from dask folks wasn't quite correct, calling .head() does seem to load all the categories and not just the categories that are actually present in the head. So #653 was not actually an issue as far as I can tell. Explicitly calling |
The 1-2 seems a bit arbitrary and I would think it wouldn't always be the optimal cutoff. If you call |
No, the overhead of .head doesn't seem to reduce much when selecting a subset of columns (in fact it increases in some cases). So this approach would be much worse than even the current performance. |
Ok, then your approach seems the most reasonable 🙂 |
I've made the cut-off zero to one categorical columns, because those are the cases where it was consistently faster. |
01ecc81
to
bc700af
Compare
bc700af
to
aa286d6
Compare
That all sounds good to me. |
if len(cat_columns) > 1: | ||
# If there is more than one categorical column it is faster | ||
# to compute the df.head() which will contain all categories | ||
return datashape.var * dshape_from_pandas(df.head()).measure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't suppose it will get any faster if you specify df.head(1)
, but maybe it's more explicit? No strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, it's slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm; that's very strange. Is df.head(5)
(or whatever the default is) also slower? Seems fishy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too surprising, it still has load the first partition and the network overhead difference between 1 and 5 rows is tiny. So the only difference is that it adds another task to the graph which runs the iloc.
@@ -361,6 +361,8 @@ def dshape_from_pandas_helper(col): | |||
dataframe. | |||
""" | |||
if isinstance(col.dtype, type(pd.Categorical.dtype)) or isinstance(col.dtype, pd.api.types.CategoricalDtype): | |||
if not getattr(col.cat, 'known', True): | |||
col = col.cat.as_known() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially expensive, and since you're dropping the reference this is work that is lost. I'm not sure if there's a better way, users interested in performance should provide dataframes with known categories in beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially expensive, and since you're dropping the reference this is work that is lost.
Right I considered assigning back to the original dataframe but that would be a bit too magical. I think if we can get memoization to work I won't feel so bad about this.
While I've got you here, I remember you telling me that calling .head()
was not usually sufficient for inferring the categories and that only fastparquet loads and stores the categories in metadata. Having played around with loading a variety of different datasets from csv, parquet loaded via pyarrow and simply converting a pandas dataframe to a dask dataframe, the .head()
seems to have consistently returned all available categories even if they are not present in .head()
. Is that just a quirk of the examples I've tried or can this be relied on after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just quirk of the examples you've tried.
In [38]: df = pd.DataFrame({'a': ['a', 'b', 'b']})
In [39]: df2 = pd.DataFrame({'a': ['b', 'b', 'c']})
In [40]: ddf = dd.concat([df, df2])
In [41]: ddf2 = ddf.astype({'a': 'category'})
In [42]: ddf2
Out[42]:
Dask DataFrame Structure:
a
npartitions=2
category[unknown]
...
...
Dask Name: astype, 6 tasks
In [43]: ddf2.head(1).a.cat.categories
Out[43]: Index(['a', 'b'], dtype='object')
In [44]: ddf2.a.cat.as_known().cat.categories
Out[44]: Index(['a', 'b', 'c'], dtype='object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!