Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling MVP #1495

Merged
merged 34 commits into from
May 30, 2024
Merged
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b256370
Enhancement 1010: Resampling MVP
alexowens90 Apr 3, 2024
2e67393
Revert to C++17
alexowens90 May 9, 2024
e752b65
Comment changes
alexowens90 May 10, 2024
03df99c
Revert change to lmdb_version_store_tiny_segment
alexowens90 May 10, 2024
6c3739c
Remove check that input has initial expected get calls in split_by_ro…
alexowens90 May 10, 2024
519e56c
Use Bucket class in aggregation as well
alexowens90 May 10, 2024
37d9810
Remove Pandas date_range timing/logging, and modified some formatting
alexowens90 May 10, 2024
8ac1c3b
Move all sorted aggregation stuff to own files
alexowens90 May 10, 2024
cac141e
Renaming refactor
alexowens90 May 13, 2024
28404a8
Renaming refactor
alexowens90 May 13, 2024
6464068
Remove unused function
alexowens90 May 13, 2024
9e79d0f
Remove summing timestamps from supported aggregations
alexowens90 May 13, 2024
76e0baf
Started refactoring aggregation
alexowens90 May 13, 2024
b4e6aa7
Added push_to_aggregator method
alexowens90 May 13, 2024
39ab291
USe push_to_aggregator in the other relevant place
alexowens90 May 14, 2024
91080f6
Fixed test_resampling_unsupported_aggregation_type_combos
alexowens90 May 14, 2024
d2369eb
Factor out finalize_aggregator
alexowens90 May 14, 2024
32547a2
Presize output index column in blocks, and trim unused blocks at the end
alexowens90 May 14, 2024
868ffa0
USe constexpr where possible
alexowens90 May 14, 2024
108867a
Reinstate all tests, reorder source files
alexowens90 May 14, 2024
91e02f4
Comment changes
alexowens90 May 14, 2024
de3bd8e
Use ColumnDataIterator in copy_frame_data_to_buffer
alexowens90 May 14, 2024
7f1d178
Revert accidentally committed change to task scheduler
alexowens90 May 14, 2024
e04b124
Comment updates
alexowens90 May 14, 2024
bb3639c
Move profile_resample.py out of tests directory
alexowens90 May 14, 2024
d9c0506
Resample docstring
alexowens90 May 14, 2024
1d5ab87
Fix mac build?
alexowens90 May 14, 2024
13f598d
Fix tests
alexowens90 May 15, 2024
b7c7a1d
Make resamply.py in ASV benchmarks dir
alexowens90 May 15, 2024
05b92ca
Dummy commit
alexowens90 May 16, 2024
4ecf145
Resampling ASV benchmarks
alexowens90 May 16, 2024
0264dc4
Update benchmarks.json file too
alexowens90 May 16, 2024
cf6772f
Remove ASV features added in 0.6.0
alexowens90 May 17, 2024
c2d994c
Address review comments
alexowens90 May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Resample docstring
alexowens90 committed May 14, 2024
commit d9c0506a74b08846d8bc4f3a754e1e8d4559015a
113 changes: 112 additions & 1 deletion python/arcticdb/version_store/processing.py
Original file line number Diff line number Diff line change
@@ -540,7 +540,7 @@ def groupby(self, name: str):
return self

def agg(self, aggregations: Dict[str, Union[str, Tuple[str, str]]]):
# Only makes sense if previous stage is a group-by
# Only makes sense if previous stage is a group-by or resample
check(
len(self.clauses) and isinstance(self.clauses[-1], (_GroupByClause, _ResampleClauseLeftClosed, _ResampleClauseRightClosed)),
f"Aggregation only makes sense after groupby or resample",
@@ -571,6 +571,117 @@ def resample(
closed: Optional[str] = None,
label: Optional[str] = None,
):
"""
Resample symbol on the index. Symbol must be datetime indexed. Resample operations must be followed by an
aggregation operator. Currently, the following 7 aggregation operators are supported:
* "mean" - compute the mean of the group
* "sum" - compute the sum of the group
* "min" - compute the min of the group
* "max" - compute the max of the group
* "count" - compute the count of group
* "first" - compute the first value in the group
* "last" - compute the last value in the group
Note that not all aggregators are supported with all column types.:
* Numeric columns - support all aggregators
* Bool columns - support all aggregators
* String columns - support count, first, and last aggregators
* Datetime columns - support all aggregators EXCEPT sum
Note that time-buckets which contain no index values in the symbol will NOT be included in the returned
DataFrame. This is not the same as Pandas default behaviour.
Resampling is currently not supported with:
* Dynamic schema where an aggregation column is missing from one or more of the row-slices.
* Sparse data.

Parameters
----------
rule: Union[`str`, 'pd.DataOffset']
The frequency at which to resample the data. Supported rule strings are ns, us, ms, s, min, h, and D, and
multiples/combinations of these, such as 1h30min. pd.DataOffset objects representing frequencies from this
set are also accepted.
closed: Optional['str'], default=None
Which boundary of each time-bucket is closed. Must be one of 'left' or 'right'. If not provided, the default
is left for all currently supported frequencies.
label: Optional['str'], default=None
Which boundary of each time-bucket is used as the index value in the returned DataFrame. Must be one of
'left' or 'right'. If not provided, the default is left for all currently supported frequencies.

Returns
-------
QueryBuilder
Modified QueryBuilder object.

Raises
-------
ArcticDbNotYetImplemented
A frequency string or Pandas DateOffset object are provided to the rule argument outside the supported
frequencies listed above.
ArcticNativeException
The closed or label arguments are not one of "left" or "right"
SchemaException
Raised on call to read if:
* If the aggregation specified is not compatible with the type of the column being aggregated as
specified above.
* The library has dynamic schema enabled, and at least one of the columns being aggregated is missing
from at least one row-slice.
* At least one of the columns being aggregated contains sparse data.

Examples
--------
Resample two hours worth of minutely data down to hourly data, summing the column 'to_sum':
>>> df = pd.DataFrame(
{
"to_sum": np.arange(120),
},
index=pd.date_range("2024-01-01", freq="min", periods=120),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h").agg({"to_sum": "sum"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_sum
2024-01-01 00:00:00 1770
2024-01-01 01:00:00 5370

As above, but specifying that the closed boundary of each time-bucket is the right hand side, and also to label
the output by the right boundary.
>>> q = adb.QueryBuilder()
>>> q = q.resample("h", closed="right", label="right").agg({"to_sum": "sum"})
>>> lib.read("symbol", query_builder=q).data
to_sum
2024-01-01 00:00:00 0
2024-01-01 01:00:00 1830
2024-01-01 02:00:00 5310

Nones, NaNs, and NaTs are omitted from aggregations
>>> df = pd.DataFrame(
{
"to_mean": [1.0, np.nan, 2.0],
},
index=pd.date_range("2024-01-01", freq="min", periods=3),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h").agg({"to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_mean
2024-01-01 00:00:00 1.5

Output column names can be controlled through the format of the dict passed to agg
>>> df = pd.DataFrame(
{
"agg_1": [1, 2, 3, 4, 5],
"agg_2": [1.0, 2.0, 3.0, np.nan, 5.0],
},
index=pd.date_range("2024-01-01", freq="min", periods=5),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h")
>>> q = q.agg({"agg_1_min": ("agg_1", "min"), "agg_1_max": ("agg_1", "max"), "agg_2": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
agg_1_min agg_1_max agg_2
2024-01-01 00:00:00 1 5 2.75
"""
check(not len(self.clauses), "resample only supported as first clause in the pipeline")
rule = rule.freqstr if isinstance(rule, pd.DateOffset) else rule
# We use floor and ceiling later to round user-provided date ranges and or start/end index values of the symbol

Unchanged files with check annotations Beta

sorted_ranges_and_keys.emplace_back(row_range, col_range, key);
}
auto ranges_and_keys = sorted_ranges_and_keys;
std::random_shuffle(ranges_and_keys.begin(), ranges_and_keys.end());

Check failure on line 57 in cpp/arcticdb/processing/test/rapidcheck_resample.cpp

GitHub Actions / Windows C++ Tests / compile (windows, windows-cl, win_amd64, C:/cpp_build, C:/vcpkg_packages, *.pdb, *.lib, *.ilk, *....

'random_shuffle': is not a member of 'std'

Check failure on line 57 in cpp/arcticdb/processing/test/rapidcheck_resample.cpp

GitHub Actions / Windows C++ Tests / compile (windows, windows-cl, win_amd64, C:/cpp_build, C:/vcpkg_packages, *.pdb, *.lib, *.ilk, *....

'random_shuffle': identifier not found
// Create vector of bucket boundary pairs, inclusive at both ends
// bucket_id will be used to refer to the index in these vectors of a specific bucket