-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap rolling sum updateBy operation #3399
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -105,7 +105,8 @@ def ema_tick_decay(time_scale_ticks: int, cols: Union[str, List[str]], | |
|
||
Args: | ||
time_scale_ticks (int): the decay rate in ticks | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can be renaming expressions, i.e. "new_col = col" | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can include expressions to rename the output, | ||
i.e. "new_col = col"; when empty, update_by perform the ema operation on all columns. | ||
op_control (OperationControl): defines how special cases should behave, when None, the default OperationControl | ||
settings as specified in :meth:`~OperationControl.__init__` will be used | ||
|
||
|
@@ -140,7 +141,8 @@ def ema_time_decay(ts_col: str, time_scale: Union[int, str], cols: Union[str, Li | |
|
||
time_scale (Union[int, str]): the decay rate, can be expressed as an integer in nanoseconds or a time | ||
interval string, e.g. "00:00:00.001" | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can be renaming expressions, i.e. "new_col = col" | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can include expressions to rename the output, | ||
i.e. "new_col = col"; when empty, update_by perform the ema operation on all columns. | ||
op_control (OperationControl): defines how special cases should behave, when None, the default OperationControl | ||
settings as specified in :meth:`~OperationControl.__init__` will be used | ||
|
||
|
@@ -151,8 +153,7 @@ def ema_time_decay(ts_col: str, time_scale: Union[int, str], cols: Union[str, Li | |
DHError | ||
""" | ||
try: | ||
if isinstance(time_scale, str): | ||
time_scale = _JDateTimeUtils.expressionToNanos(time_scale) | ||
time_scale = _JDateTimeUtils.expressionToNanos(time_scale) if isinstance(time_scale, str) else time_scale | ||
cols = to_sequence(cols) | ||
if op_control is None: | ||
return UpdateByOperation(j_updateby_op=_JUpdateByOperation.Ema(ts_col, time_scale, *cols)) | ||
|
@@ -267,3 +268,86 @@ def forward_fill(cols: Union[str, List[str]]) -> UpdateByOperation: | |
return UpdateByOperation(j_updateby_op=_JUpdateByOperation.Fill(cols)) | ||
except Exception as e: | ||
raise DHError(e, "failed to create a forward fill UpdateByOperation.") from e | ||
|
||
|
||
def rolling_sum_tick(cols: Union[str, List[str]], rev_ticks: int, fwd_ticks: int = 0) -> UpdateByOperation: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jmao-denver thoughts on the argument ordering vs other methods in this module? e.g. did we do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parameter ordering of ema_tick_decay/ema_time_decay matches the Java ones more closely, which is nice. But it is not something we can do with rolling sums. I have no objection to making them consistent if you feel strongly about it. Even though it is a breaking change, I doubt we'll affect users so much as our own docs/examples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also would prefer a closer match with Java/Groovy for |
||
"""Creates a rolling sum UpdateByOperation for the supplied column names, using ticks as the windowing unit. Ticks | ||
are row counts, and you may specify the reverse and forward window in number of rows to include. The current row | ||
is considered to belong to the reverse window but not the forward window. Also, negative values are allowed and | ||
can be used to generate completely forward or completely reverse windows. | ||
|
||
Here are some examples of window values: | ||
rev_ticks = 1, fwd_ticks = 0 - contains only the current row | ||
rev_ticks = 10, fwd_ticks = 0 - contains 9 previous rows and the current row | ||
rev_ticks = 0, fwd_ticks = 10 - contains the following 10 rows, excludes the current row | ||
rev_ticks = 10, fwd_ticks = 10 - contains the previous 9 rows, the current row and the 10 rows following | ||
rev_ticks = 10, fwd_ticks = -5 - contains 5 rows, beginning at 9 rows before, ending at 5 rows before the | ||
current row (inclusive) | ||
rev_ticks = 11, fwd_ticks = -1 - contains 10 rows, beginning at 10 rows before, ending at 1 row before the | ||
current row (inclusive) | ||
rev_ticks = -5, fwd_ticks = 10 - contains 5 rows, beginning 5 rows following, ending at 10 rows following the | ||
current row (inclusive) | ||
|
||
|
||
Args: | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can include expressions to rename the output, | ||
i.e. "new_col = col"; when empty, update_by perform the rolling sum operation on all columns. | ||
rev_ticks (int): the look-behind window size (in rows/ticks) | ||
fwd_ticks (int): the look-forward window size (int rows/ticks), default is 0 | ||
|
||
Returns: | ||
an UpdateByOperation | ||
|
||
Raises: | ||
DHError | ||
""" | ||
try: | ||
cols = to_sequence(cols) | ||
return UpdateByOperation(j_updateby_op=_JUpdateByOperation.RollingSum(rev_ticks, fwd_ticks, *cols)) | ||
except Exception as e: | ||
raise DHError(e, "failed to create a rolling sum (tick) UpdateByOperation.") from e | ||
|
||
|
||
def rolling_sum_time(ts_col: str, cols: Union[str, List[str]], rev_time: Union[int, str], | ||
jmao-denver marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fwd_time: Union[int, str] = 0) -> UpdateByOperation: | ||
"""Creates a rolling sum UpdateByOperation for the supplied column names, using time as the windowing unit. This | ||
function accepts nanoseconds or time strings as the reverse and forward window parameters. Negative values are | ||
allowed and can be used to generate completely forward or completely reverse windows. A row containing a null in | ||
the timestamp column belongs to no window and will not have a value computed or be considered in the windows of | ||
other rows. | ||
|
||
Here are some examples of window values: | ||
rev_time = 0, fwd_time = 0 - contains rows that exactly match the current row timestamp | ||
rev_time = "00:10:00", fwd_time = "0" - contains rows from 10m before through the current row timestamp ( | ||
inclusive) | ||
rev_time = 0, fwd_time = 600_000_000_000 - contains rows from the current row through 10m following the | ||
current row timestamp (inclusive) | ||
rev_time = "00:10:00", fwd_time = "00:10:00" - contains rows from 10m before through 10m following | ||
the current row timestamp (inclusive) | ||
rev_time = "00:10:00", fwd_time = "-00:05:00" - contains rows from 10m before through 5m before the | ||
current row timestamp (inclusive), this is a purely backwards looking window | ||
rev_time = "-00:05:00", fwd_time = "00:10:00"} - contains rows from 5m following through 10m | ||
following the current row timestamp (inclusive), this is a purely forwards looking window | ||
|
||
Args: | ||
ts_col (str): | ||
cols (Union[str, List[str]]): the column(s) to be operated on, can include expressions to rename the output, | ||
i.e. "new_col = col"; when empty, update_by perform the rolling sum operation on all columns. | ||
rev_time (int): the look-behind window size, can be expressed as an integer in nanoseconds or a time | ||
interval string, e.g. "00:00:00.001" | ||
fwd_time (int): the look-ahead window size, can be expressed as an integer in nanoseconds or a time | ||
interval string, e.g. "00:00:00.001", default is 0 | ||
|
||
Returns: | ||
an UpdateByOperation | ||
|
||
Raises: | ||
DHError | ||
""" | ||
try: | ||
cols = to_sequence(cols) | ||
rev_time = _JDateTimeUtils.expressionToNanos(rev_time) if isinstance(rev_time, str) else rev_time | ||
jmao-denver marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fwd_time = _JDateTimeUtils.expressionToNanos(fwd_time) if isinstance(fwd_time, str) else fwd_time | ||
return UpdateByOperation(j_updateby_op=_JUpdateByOperation.RollingSum(ts_col, rev_time, fwd_time, *cols)) | ||
except Exception as e: | ||
raise DHError(e, "failed to create a rolling sum (time) UpdateByOperation.") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this made me wonder if we made an API mistake. For the rolling sum, ema, etc, we have separate methods for different flavors (e.g.
ema_tick_decay
andema_time_decay
). It seems like it might be more pythonic to haveema
androlling_sum
, where the flavor is determined by the inputs. I'm not sure how strong my thoughts are, so this is more of a discussion item.@jmao-denver @jjbrosnan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the benefits of folding them into one function, but I prefer to have separate functions (especially when the parameter sets are quite different), as it is more explicit, cleaner, less chance for users to make mistake as far as arguments are concerned, also easier for us to document/maintain the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the explicitness of the
ticks
/time
signatures but I have near-zero python experience in a professional role.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the parameter sets need to be that different. If these were combined into one function, you could have the first argument
ts_col
be optional with a default value ofNone
. The rest of the arguments could look like they do in the time-based rolling sum functions, where each argument must be an int or string. In a tick case, you give ints. In the time case, if you give an int, it assumes seconds, and if you give a string, it can be whatever.Then, the try/except block at the end can be separated into two parts, which depend on whether or not a timestamp column is given.
I think the benefits of combining each into one function outweigh the drawbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, for the rolling sum, it would be (without comments):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjbrosnan some extra testing in the
ticks
case will be needed. Time strings need to be rejected as they make no sense for row-based computations. You will get an exception from the call, so as long as the error is clear this looks good.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is doable (we'll need to add checks for invalid argument combinations) but to me the extra cost and the potential cognitive burden on a user to understand the difference outweighs the gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also having ts_col as optional pretty much forces the users to use keyword arguments in the tick case, which may not be a bad thing but then it is not consistent with the time case.