[FEA] Rolling standard deviation #8695

beckernick · 2021-07-08T21:03:52Z

Today, I can calculate rolling average, sum, and a variety of other aggregations. I'd like to also calculate the rolling standard deviation.

As an example, I might have a large set of sensor data. To make sure my sensors are behaving within normal range, I'd like to measure the rolling standard deviation and post-process the results to alert me if any window has a standard deviation more than some threshold beyond an acceptable range.

Spark differentiates between the sample and population standard deviation (stddev_samp vs stddev_pop), while pandas instead parameterized the std function with an argument for degrees of freedom.

Pandas:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
import pandas as pd

spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()

df = pd.DataFrame({
    "a": [10,3,4,2,-3,9,10],
    "b": [10,23,-4,2,-3,9,19],
    "c": [10,-23,-4,21,-3,19,19],
})

print(df.a.rolling(3).std())
0         NaN
1         NaN
2    3.785939
3    1.000000
4    3.605551
5    6.027714
6    7.234178
Name: a, dtype: float64

Spark:

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")

sdf.withColumn(
    "std",
    F.stddev_samp("a").over(Window.rowsBetween(-2, 0))
).show()
+---+---+---+------------------+
|  a|  b|  c|               std|
+---+---+---+------------------+
| 10| 10| 10|              null|
|  3| 23|-23| 4.949747468305833|
|  4| -4| -4|3.7859388972001824|
|  2|  2| 21|               1.0|
| -3| -3| -3| 3.605551275463989|
|  9|  9| 19| 6.027713773341708|
| 10| 19| 19| 7.234178138070234|
+---+---+---+------------------+

The text was updated successfully, but these errors were encountered:

beckernick · 2021-07-20T21:35:01Z

Chatted with @isVoid offline to discuss this in the context of data types (decimal, datetime, and timedelta).

Datetime

Neither Spark nor pandas support this operation on built in Datetime types.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
import pandas as pd
import numpy as np

np.random.seed(12)

spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()

nrows = 100
keycol = [0] * (nrows//2) + [1] * (nrows//2)

df = pd.DataFrame({
    "key": keycol,
    "a": np.random.randint(0, 100, nrows),
    "b": np.random.randint(0, 100, nrows),
    "c": np.random.randint(0, 100, nrows),
    "d": pd.date_range(start="2001-01-01", periods=nrows, freq="D"),
})
df["e"] = pd.to_timedelta(df.d.astype("int"))


# df.rolling(4).d.std().head(10) # NotImplementError

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")

# sdf.withColumn(
#     "std",
#     F.stddev_samp("d").over(Window.rowsBetween(-2, 0))
# ).show(5) # AnalysisException

Decimal
Spark supports this operation on Decimal types. Pandas doesn’t have a builtin type, but will succeed with an object
column of Decimals.

new = sdf.withColumn("b_decimal", sdf.b.cast("Decimal"))
new.select(["b_decimal"]).withColumn(
    "std",
    F.stddev_samp("b_decimal").over(Window.rowsBetween(-2, 0))
).show(5)
+---------+------------------+
|b_decimal|               std|
+---------+------------------+
|       68|              null|
|       25|30.405591591021544|
|       44|21.548395145191982|
|       22|11.930353445448853|
|       69|23.515952032609693|
+---------+------------------+
only showing top 5 rows

from decimal import Decimal
s = pd.Series([Decimal("10.0"), Decimal("10.0"), Decimal("11.0")])
s.rolling(2).std()
0         NaN
1    0.000000
2    0.707107
dtype: float64

Timedelta

Pandas does not support this operation on the timedelta dtype, and I believe Spark does not have an analogous type to timedelta (please correct me if I'm wrong!).

df.e.rolling(2).std()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_series(self, homogeneous_func, name)
    368             input = obj.values if name != "count" else notna(obj.values).astype(int)
--> 369             values = self._prep_values(input)
    370         except (TypeError, NotImplementedError) as err:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _prep_values(self, values)
    276         elif needs_i8_conversion(values.dtype):
--> 277             raise NotImplementedError(
    278                 f"ops for {self._window_type} for this "

NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented

harrism · 2021-07-20T21:36:32Z

@sameerz for Spark

revans2 · 2021-07-27T14:12:26Z

From the Spark perspective we really would like to be able to do stddev_samp and stddev_pop. I am not a data scientist nor a statistician so I don't know if there is a way for us to get stddev_samp, stddev_pop, and degrees of freedom from the same core aggregation. If there is we are happy to use it, even if it requires some extra post processing. Spark only supports stddev_samp and stddev_pop on double values. It will automatically convert many other types to doubles before doing the computation.

Spark is trying to become more ANSI complaint and is adding in some time delta like support, but it is not something that the RAPIDS plugin is working on right now. Spark does support a CalendarInterval type. This is a combination of month, day, and microsecond intervals, but it is mostly used for operations like add 3 months and 2 days to a date column. You can have a column of CalendarIntervals, but it is not common.

revans2 · 2021-07-27T14:15:24Z

OK I looked at the math used by spark to calculate std_pop vs std_samp and the ddof explanation in #8809 so it looks like it will work for us.

Part 1 of #8695 This PR adds support to `STD` and `VARIANCE` rolling aggregations in libcudf. - Supported types include numeric types and fixed point types. Chrono types are not supported - see thread in issue. Implementation notes: - [Welford](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)'s algorithm is used Authors: - Michael Wang (https://github.com/isVoid) Approvers: - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) URL: #8809

…rolling.std` (#9097) Closes #8695 Closes #8696 This PR creates bindings for rolling aggregations for variance and standard deviations. Unlike pandas, the underlying implementation from libcudf computes each window independently from other windows. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Sheilah Kirui (https://github.com/skirui-source) URL: #9097

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 8, 2021

beckernick mentioned this issue Jul 8, 2021

[FEA] Groupby rolling standard deviation #8696

Closed

beckernick added this to the Time Series Analysis milestone Jul 14, 2021

isVoid self-assigned this Jul 20, 2021

isVoid mentioned this issue Jul 21, 2021

Support VARIANCE and STD aggregation in rolling op #8809

Merged

beckernick mentioned this issue Jul 23, 2021

[BUG] std does not compute when using rolling window #2757

Closed

isVoid mentioned this issue Aug 23, 2021

Add python bindings to fixed-size window and groupby rolling.var, rolling.std #9097

Merged

rapids-bot bot closed this as completed in #9097 Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Rolling standard deviation #8695

[FEA] Rolling standard deviation #8695

beckernick commented Jul 8, 2021

beckernick commented Jul 20, 2021 •

edited

Loading

harrism commented Jul 20, 2021

revans2 commented Jul 27, 2021

revans2 commented Jul 27, 2021

[FEA] Rolling standard deviation #8695

[FEA] Rolling standard deviation #8695

Comments

beckernick commented Jul 8, 2021

beckernick commented Jul 20, 2021 • edited Loading

harrism commented Jul 20, 2021

revans2 commented Jul 27, 2021

revans2 commented Jul 27, 2021

beckernick commented Jul 20, 2021 •

edited

Loading