-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Implement groupby.sample
#12882
[ENH] Implement groupby.sample
#12882
Conversation
We needed groupby-sample for some work, this was the fastest implementation I could come up with that didn't involve writing specific C++. This is only a partial implementation of the full pandas API, and I am using the Python random module (since both cupy and numpy are horribly slow when taking samples: there is no ufunc/vectorised version of Is this is a reasonable approach? Or should I come up with some way of doing the sample all-at-once on device? Trivial benchmark:
Bumping to 100_000 groups, pandas takes 4s, this approach takes 260ms. |
To do so, obtain the group offsets and values (and hence index). Sample within each group, and then pull out rows from the original object. The fastest way to do this in Python is via the builtin random library, since neither numpy nor cupy offer a broadcasted/ufunc random.sample, and looping over the groups is very slow using either of them. Looping over the groups and using python random.sample is also slow, but less so.
1774fa7
to
008cfe1
Compare
A while ago I prototyped a version of https://gist.github.com/shwina/ac4cdffe00ce341a65e5e27b78d0b2a0 (feedback welcome -- I'm not sure if I might have overlooked something) |
To inline that code for discussion def groupby_sample(df, by, n=2):
df = df.sample(frac=1).reset_index(drop=True) # randomise order of dataframe
return df.loc[df.groupby(by)[by].rank("first") <= 2, :] With some futzing around This works because for As soon as we introduce keyword arguments into the mix (and to be fair, I didn't get round to implementing those in this PR), the commutator is non-zero, and so we need an alternate approach (e.g. groupby.sample(replace=True) samples with replacement within each group, whereas |
I agree -- perhaps we could use the |
Benchmarks incoming. @bdice suggested a way to squash one more slow path, but I worry that it might suffer from unfair sampling:
So I haven't implemented this one. The general approach is ready to look at though. |
|
@wence- If you can do a segmented argsort, I may have a solution for sampling without replacement (SWOR). Based on https://timvieira.github.io/blog/post/2019/09/16/algorithms-for-sampling-without-replacement/ edit: ooh, maybe not. The analytical probabilities of SWOR may be different here because the sampling should be with respect to each group. edit 2: On second thought, maybe this is still legitimate? The probabilities don't depend on group size, and doing this for each segment individually should be equivalent to doing all segments together, so long as you use the segmented argsort correctly. Use a modified import numpy as np
def swor_exp_segmented(total_samples, segment_indices):
E = -np.log(np.random.uniform(0,1,size=total_samples))
# You can implement segmented_argsort using CUB. There are probably primitives in cupy/cudf that can do this?
return segmented_argsort(E, segment_indices) This gives you a randomly sorted list of indices within each segment. Then you can use a gather to fetch the desired number of indices from each segment, and then use that result as a gather map of indices to fetch from the original groupings. |
Oh that's a cute trick! |
In fact, you don’t even need the “log” bit. Assuming all the sample elements have equal weight, any random permutation of the group indices will be equally likely. |
Yeah, so that post is fast ways of sampling from the categorical distribution (where you have specified weights for each element of the group, and all elements are unique). That will be useful (since groupby.sample with weights is something that is in the pandas API, though I am punting on it in this PR), though here we have a slightly simpler problem, we have a group of size Indeed, this is exactly the implementation used in the PR, for the case of I'll noodle through a few more ideas... |
Also remove large memory footprint for sampling without replacement.
It’s not so bad! You’re only dealing with integer columns of indices for gathering, not full dataframes (all columns) as far as I can tell. Usually I’d say that adding 2-3 columns of integers with the same number of rows as the dataframe is an acceptable memory cost. It is hefty for single-column dataframes but not a big deal for 10 column dataframes, as a fraction of the dataframe memory. Are there copies of the full dataframe that I’m not considering ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes.
Also speed up index masking and add pointers to implementation ideas for weighted sampling.
I did some basic benchmarking of this code. def create_df(n, nunique):
return cudf.DataFrame(
{"a": cp.random.randint(0, nunique, size=n), "b": cp.arange(n)}
)
def sample_n(df, n, replace):
start = time.time()
_ = df.groupby("a").sample(n=n, replace=replace)
end = time.time()
return end - start
def sample_frac(df, frac, replace):
start = time.time()
_ = df.groupby("a").sample(frac=frac, replace=replace)
end = time.time()
return end - start
def gather_results():
data = []
for n in [100_000, 1_000_000, 10_000_000, 100_000_000]:
print(f"DataFrame size {n=}")
nuniques = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000, 100_000_000]
nuniques = nuniques[: nuniques.index(n) - 1]
for nunique in nuniques:
df = create_df(n, nunique)
for pandas in [False, True]:
if pandas:
idf = df.to_pandas()
else:
idf = df
for replace in [False, True]:
nsamp = 5
for nsamp in [1, 2, 10, 50]:
frac = nunique * nsamp / n
t = sample_n(idf, nsamp, replace)
data.append(
(
n,
nunique,
nsamp,
frac,
replace,
False,
"pandas" if pandas else "cudf",
t,
)
)
gc.collect()
t = sample_frac(idf, frac, replace)
data.append(
(
n,
nunique,
nsamp,
frac,
replace,
True,
"pandas" if pandas else "cudf",
t,
)
)
gc.collect()
return pd.DataFrame(
data,
columns=[
"df_size",
"unique_vals",
"sample_size",
"sample_frac",
"replace",
"use_frac",
"backend",
"time",
],
) Run on an A6000 (cudf) and an intel xeon gold 6226R CPU (pandas). Parquet data for further analysis attached (groupby-sample-benchmarks.zip), but I also upload some faceted-plots (log-scale on the timings axis). tl;dr: For a dataframe with With replacement sampling is slower because there isn't a fast cupy way, AFAICT, to generate the random numbers which takes about a second of the run. Without replacement sampling could be faster if cupy had a batched/segmented shuffle. In contrast, if one does the same thing but only selects a single value from each group, then the cudf implementation takes between 0.25 and 0.5s. |
groupby.sample
Fixed up the merge conflicts, so I think this is good for a final look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! A few comments.
def _(): | ||
return grouper.sample(**kwargs) | ||
|
||
benchmark(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work?
def _(): | |
return grouper.sample(**kwargs) | |
benchmark(_) | |
benchmark(grouper.sample, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, probably yes...
python/cudf/cudf/_lib/sort.pyx
Outdated
if asc: | ||
c_column_order.push_back(order.ASCENDING) | ||
else: | ||
c_column_order.push_back(order.DESCENDING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if asc: | |
c_column_order.push_back(order.ASCENDING) | |
else: | |
c_column_order.push_back(order.DESCENDING) | |
c_column_order.push_back(order.ASCENDING if asc else order.DESCENDING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was just copying the previous order_by
implementation, but done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think everything is addressed.
def _(): | ||
return grouper.sample(**kwargs) | ||
|
||
benchmark(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, probably yes...
python/cudf/cudf/_lib/sort.pyx
Outdated
if asc: | ||
c_column_order.push_back(order.ASCENDING) | ||
else: | ||
c_column_order.push_back(order.DESCENDING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was just copying the previous order_by
implementation, but done.
12ff2cd
to
558541d
Compare
/merge |
Description
To do so, obtain the group offsets and values (and hence index). Sample within each group, and then pull out rows from the original object.
The fastest way to do this in Python is via the builtin random library, since neither numpy nor cupy offer a broadcasted/ufunc random.sample, and looping over the groups is very slow using either of them. Looping over the groups and using python random.sample is also slow, but less so.
Checklist