[FEA] Groupby rolling standard deviation #8696
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Milestone
Today, I can calculate rolling mean, sum, and a variety of other aggregations. I can also calculate rolling mean and sum on a per group basis (groupby.rolling / grouped window). I'd like to also calculate the groupby rolling standard deviation.
To use the same example as in the related #8695 (rolling standard deviation), I might have a large set of sensor data. To make sure my sensors are behaving within normal range, I'd like to measure the rolling standard deviation and post-process the results to alert me if any window has a standard deviation more than some threshold beyond an acceptable range. However, in this case, I have many sensors, so I need to measure the rolling standard deviation for each individual sensor. (In the example below, I'm assuming the data is pre-sorted in the relevant order, as the pandas groupby preserves exist sort order by default).
Spark differentiates between the sample and population standard deviation (stddev_samp vs stddev_pop), while pandas instead parameterized the std function with an argument for degrees of freedom.
Filing a separate issue per a chat with @jrhemstad that the implementations may be different enough to warrant discussing them separately.
Pandas:
Spark:
The text was updated successfully, but these errors were encountered: