From 335862bf5bc8433d5d0896adf8210a13c5a21ab3 Mon Sep 17 00:00:00 2001 From: Alessandro Bellina Date: Mon, 13 Dec 2021 15:08:39 -0600 Subject: [PATCH 1/4] Fix fallback to sort aggregation for grouping only hash aggregate (#9891) The following fixes what looks like an unintended fallback to sort aggregate introduced in https://github.com/rapidsai/cudf/pull/9545 for a grouping only (no aggregation request) case. In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650). I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`: ``` for (auto _ : state) { cuda_event_timer timer(state, true); cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e auto result = gb_obj.aggregate(requests); } ``` This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case. If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark. Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly. This was executed on a T4 GPU. Before the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 ```
Full output

``` 2021-12-12T13:41:08+00:00 Running ./GROUPBY_BENCH Run on (64 X 2801.89 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 1.01, 0.78, 0.42 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.117 ms 0.143 ms 5906 Groupby/Basic/1000000/manual_time 0.524 ms 0.542 ms 1352 Groupby/Basic/10000000/manual_time 4.41 ms 4.43 ms 159 Groupby/Basic/100000000/manual_time 50.1 ms 50.1 ms 13 Groupby/PreSorted/1000000/manual_time 0.332 ms 0.350 ms 2118 Groupby/PreSorted/10000000/manual_time 2.22 ms 2.23 ms 315 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 30 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4381 Groupby/PreSortedNth/10000000/manual_time 0.890 ms 0.917 ms 774 Groupby/PreSortedNth/100000000/manual_time 8.43 ms 8.46 ms 68 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 902 Groupby/Shift/10000000/manual_time 9.51 ms 9.53 ms 63 Groupby/Shift/100000000/manual_time 145 ms 145 ms 4 Groupby/Aggregation/10000/manual_time 1.56 ms 1.58 ms 442 Groupby/Aggregation/16384/manual_time 1.59 ms 1.62 ms 435 Groupby/Aggregation/65536/manual_time 1.73 ms 1.76 ms 404 Groupby/Aggregation/262144/manual_time 2.95 ms 2.98 ms 237 Groupby/Aggregation/1048576/manual_time 9.20 ms 9.23 ms 73 Groupby/Aggregation/4194304/manual_time 36.3 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 92.0 ms 92.1 ms 7 Groupby/Scan/10000/manual_time 1.56 ms 1.58 ms 447 Groupby/Scan/16384/manual_time 1.62 ms 1.65 ms 429 Groupby/Scan/65536/manual_time 1.85 ms 1.88 ms 378 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 197 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.6 ms 48.6 ms 14 Groupby/Scan/10000000/manual_time 126 ms 126 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 Groupby/PreSortedNoRequests/1000000/manual_time 0.194 ms 0.214 ms 3624 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 571 Groupby/PreSortedNoRequests/100000000/manual_time 12.6 ms 12.7 ms 50 ```

After the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 ```
Full output

``` 2021-12-12T13:37:50+00:00 Running ./GROUPBY_BENCH Run on (64 X 2654.22 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 0.64, 0.50, 0.26 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.116 ms 0.142 ms 5918 Groupby/Basic/1000000/manual_time 0.523 ms 0.542 ms 1374 Groupby/Basic/10000000/manual_time 4.37 ms 4.39 ms 162 Groupby/Basic/100000000/manual_time 51.4 ms 51.5 ms 10 Groupby/PreSorted/1000000/manual_time 0.331 ms 0.350 ms 2121 Groupby/PreSorted/10000000/manual_time 2.21 ms 2.23 ms 316 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 27 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4384 Groupby/PreSortedNth/10000000/manual_time 0.888 ms 0.915 ms 775 Groupby/PreSortedNth/100000000/manual_time 8.36 ms 8.39 ms 70 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 904 Groupby/Shift/10000000/manual_time 9.50 ms 9.52 ms 63 Groupby/Shift/100000000/manual_time 146 ms 146 ms 4 Groupby/Aggregation/10000/manual_time 1.53 ms 1.55 ms 446 Groupby/Aggregation/16384/manual_time 1.58 ms 1.61 ms 437 Groupby/Aggregation/65536/manual_time 1.72 ms 1.75 ms 405 Groupby/Aggregation/262144/manual_time 2.93 ms 2.96 ms 236 Groupby/Aggregation/1048576/manual_time 9.18 ms 9.21 ms 74 Groupby/Aggregation/4194304/manual_time 36.2 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 91.5 ms 91.6 ms 7 Groupby/Scan/10000/manual_time 1.55 ms 1.57 ms 452 Groupby/Scan/16384/manual_time 1.60 ms 1.62 ms 434 Groupby/Scan/65536/manual_time 1.84 ms 1.87 ms 379 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 198 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.4 ms 48.4 ms 14 Groupby/Scan/10000000/manual_time 125 ms 125 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 Groupby/PreSortedNoRequests/1000000/manual_time 0.195 ms 0.215 ms 3604 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 575 Groupby/PreSortedNoRequests/100000000/manual_time 12.7 ms 12.8 ms 50 ```

Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: https://github.com/rapidsai/cudf/pull/9891 --- cpp/benchmarks/CMakeLists.txt | 8 +- .../groupby/group_no_requests_benchmark.cu | 115 ++++++++++++++++++ cpp/src/groupby/hash/groupby.cu | 27 ++-- 3 files changed, 131 insertions(+), 19 deletions(-) create mode 100644 cpp/benchmarks/groupby/group_no_requests_benchmark.cu diff --git a/cpp/benchmarks/CMakeLists.txt b/cpp/benchmarks/CMakeLists.txt index 71b452a1004..63f6857ee08 100644 --- a/cpp/benchmarks/CMakeLists.txt +++ b/cpp/benchmarks/CMakeLists.txt @@ -171,8 +171,12 @@ ConfigureBench(FILL_BENCH filling/repeat_benchmark.cpp) # ################################################################################################## # * groupby benchmark ----------------------------------------------------------------------------- ConfigureBench( - GROUPBY_BENCH groupby/group_sum_benchmark.cu groupby/group_nth_benchmark.cu - groupby/group_shift_benchmark.cu groupby/group_struct_benchmark.cu + GROUPBY_BENCH + groupby/group_sum_benchmark.cu + groupby/group_nth_benchmark.cu + groupby/group_shift_benchmark.cu + groupby/group_struct_benchmark.cu + groupby/group_no_requests_benchmark.cu ) # ################################################################################################## diff --git a/cpp/benchmarks/groupby/group_no_requests_benchmark.cu b/cpp/benchmarks/groupby/group_no_requests_benchmark.cu new file mode 100644 index 00000000000..7dbe1888cee --- /dev/null +++ b/cpp/benchmarks/groupby/group_no_requests_benchmark.cu @@ -0,0 +1,115 @@ +/* + * Copyright (c) 2021, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +class Groupby : public cudf::benchmark { +}; + +// TODO: put it in a struct so `uniform` can be remade with different min, max +template +T random_int(T min, T max) +{ + static unsigned seed = 13377331; + static std::mt19937 engine{seed}; + static std::uniform_int_distribution uniform{min, max}; + + return uniform(engine); +} + +void BM_basic_no_requests(benchmark::State& state) +{ + using wrapper = cudf::test::fixed_width_column_wrapper; + + const cudf::size_type column_size{(cudf::size_type)state.range(0)}; + + auto data_it = cudf::detail::make_counting_transform_iterator( + 0, [=](cudf::size_type row) { return random_int(0, 100); }); + + wrapper keys(data_it, data_it + column_size); + wrapper vals(data_it, data_it + column_size); + + std::vector requests; + + for (auto _ : state) { + cuda_event_timer timer(state, true); + cudf::groupby::groupby gb_obj(cudf::table_view({keys})); + auto result = gb_obj.aggregate(requests); + } +} + +BENCHMARK_DEFINE_F(Groupby, BasicNoRequest)(::benchmark::State& state) +{ + BM_basic_no_requests(state); +} + +BENCHMARK_REGISTER_F(Groupby, BasicNoRequest) + ->UseManualTime() + ->Unit(benchmark::kMillisecond) + ->Arg(10000) + ->Arg(1000000) + ->Arg(10000000) + ->Arg(100000000); + +void BM_pre_sorted_no_requests(benchmark::State& state) +{ + using wrapper = cudf::test::fixed_width_column_wrapper; + + const cudf::size_type column_size{(cudf::size_type)state.range(0)}; + + auto data_it = cudf::detail::make_counting_transform_iterator( + 0, [=](cudf::size_type row) { return random_int(0, 100); }); + auto valid_it = cudf::detail::make_counting_transform_iterator( + 0, [=](cudf::size_type row) { return random_int(0, 100) < 90; }); + + wrapper keys(data_it, data_it + column_size); + wrapper vals(data_it, data_it + column_size, valid_it); + + auto keys_table = cudf::table_view({keys}); + auto sort_order = cudf::sorted_order(keys_table); + auto sorted_keys = cudf::gather(keys_table, *sort_order); + // No need to sort values using sort_order because they were generated randomly + + std::vector requests; + + for (auto _ : state) { + cuda_event_timer timer(state, true); + cudf::groupby::groupby gb_obj(*sorted_keys, cudf::null_policy::EXCLUDE, cudf::sorted::YES); + auto result = gb_obj.aggregate(requests); + } +} + +BENCHMARK_DEFINE_F(Groupby, PreSortedNoRequests)(::benchmark::State& state) +{ + BM_pre_sorted_no_requests(state); +} + +BENCHMARK_REGISTER_F(Groupby, PreSortedNoRequests) + ->UseManualTime() + ->Unit(benchmark::kMillisecond) + ->Arg(1000000) + ->Arg(10000000) + ->Arg(100000000); diff --git a/cpp/src/groupby/hash/groupby.cu b/cpp/src/groupby/hash/groupby.cu index 1474cfd8a19..4f2cb4de14b 100644 --- a/cpp/src/groupby/hash/groupby.cu +++ b/cpp/src/groupby/hash/groupby.cu @@ -636,23 +636,16 @@ std::unique_ptr groupby(table_view const& keys, */ bool can_use_hash_groupby(table_view const& keys, host_span requests) { - auto const all_hash_aggregations = - std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) { - return cudf::has_atomic_support(r.values.type()) and - std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) { - return is_hash_aggregation(a->kind); - }); - }); - - // Currently, structs are not supported in any of hash-based aggregations. - // Therefore, if any request contains structs then we must fallback to sort-based aggregations. - // TODO: Support structs in hash-based aggregations. - auto const has_struct = - std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) { - return r.values.type().id() == type_id::STRUCT; - }); - - return all_hash_aggregations && !has_struct; + return std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) { + // Currently, structs are not supported in any of hash-based aggregations. + // Therefore, if any request contains structs then we must fallback to sort-based aggregations. + // TODO: Support structs in hash-based aggregations. + return not(r.values.type().id() == type_id::STRUCT) and + cudf::has_atomic_support(r.values.type()) and + std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) { + return is_hash_aggregation(a->kind); + }); + }); } // Hash-based groupby From b3b299ae22f31adb8f380d7add1ce2bdb726ab26 Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Mon, 13 Dec 2021 20:00:37 -0800 Subject: [PATCH 2/4] Break tie for `top` categorical columns in `Series.describe` (#9867) Closes #9825 This PR fixes the bug that when there are ties with most frequent categories in categorical series, the order returned is undefined. We break the tie by sorting the category and take its top. Also included in this PR: refactors `describe` to use a dictionary items to describe for better readability. And enables https://github.com/rapidsai/cudf/blob/024003ca444f9d1a8374a1133337419f22cc880a/python/cudf/cudf/tests/test_series.py#L405 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: https://github.com/rapidsai/cudf/pull/9867 --- python/cudf/cudf/core/series.py | 144 ++++++++++++++------------ python/cudf/cudf/tests/test_series.py | 37 +++---- 2 files changed, 93 insertions(+), 88 deletions(-) diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py index 3ca77105d1b..bbeae1adc5e 100644 --- a/python/cudf/cudf/core/series.py +++ b/python/cudf/cudf/core/series.py @@ -3312,51 +3312,55 @@ def _format_percentile_names(percentiles): return ["{0}%".format(int(x * 100)) for x in percentiles] def _format_stats_values(stats_data): - return list(map(lambda x: round(x, 6), stats_data)) + return map(lambda x: round(x, 6), stats_data) def _describe_numeric(self): # mimicking pandas - index = ( - ["count", "mean", "std", "min"] - + _format_percentile_names(percentiles) - + ["max"] - ) - data = ( - [self.count(), self.mean(), self.std(), self.min()] - + self.quantile(percentiles).to_numpy(na_value=np.nan).tolist() - + [self.max()] - ) - data = _format_stats_values(data) + data = { + "count": self.count(), + "mean": self.mean(), + "std": self.std(), + "min": self.min(), + **dict( + zip( + _format_percentile_names(percentiles), + self.quantile(percentiles) + .to_numpy(na_value=np.nan) + .tolist(), + ) + ), + "max": self.max(), + } return Series( - data=data, index=index, nan_as_null=False, name=self.name, + data=_format_stats_values(data.values()), + index=data.keys(), + nan_as_null=False, + name=self.name, ) def _describe_timedelta(self): # mimicking pandas - index = ( - ["count", "mean", "std", "min"] - + _format_percentile_names(percentiles) - + ["max"] - ) - - data = ( - [ - str(self.count()), - str(self.mean()), - str(self.std()), - str(pd.Timedelta(self.min())), - ] - + self.quantile(percentiles) - .astype("str") - .to_numpy(na_value=None) - .tolist() - + [str(pd.Timedelta(self.max()))] - ) + data = { + "count": str(self.count()), + "mean": str(self.mean()), + "std": str(self.std()), + "min": str(pd.Timedelta(self.min())), + **dict( + zip( + _format_percentile_names(percentiles), + self.quantile(percentiles) + .astype("str") + .to_numpy(na_value=np.nan) + .tolist(), + ) + ), + "max": str(pd.Timedelta(self.max())), + } return Series( - data=data, - index=index, + data=data.values(), + index=data.keys(), dtype="str", nan_as_null=False, name=self.name, @@ -3365,51 +3369,55 @@ def _describe_timedelta(self): def _describe_categorical(self): # blocked by StringColumn/DatetimeColumn support for # value_counts/unique - index = ["count", "unique", "top", "freq"] - val_counts = self.value_counts(ascending=False) - data = [self.count(), self.unique().size] - - if data[1] > 0: - top, freq = val_counts.index[0], val_counts.iloc[0] - data += [str(top), freq] - # If the DataFrame is empty, set 'top' and 'freq' to None - # to maintain output shape consistency - else: - data += [None, None] + data = { + "count": self.count(), + "unique": len(self.unique()), + "top": None, + "freq": None, + } + if data["count"] > 0: + # In case there's a tie, break the tie by sorting the index + # and take the top. + val_counts = self.value_counts(ascending=False) + tied_val_counts = val_counts[ + val_counts == val_counts.iloc[0] + ].sort_index() + data.update( + { + "top": tied_val_counts.index[0], + "freq": tied_val_counts.iloc[0], + } + ) return Series( - data=data, + data=data.values(), dtype="str", - index=index, + index=data.keys(), nan_as_null=False, name=self.name, ) def _describe_timestamp(self): - - index = ( - ["count", "mean", "min"] - + _format_percentile_names(percentiles) - + ["max"] - ) - - data = ( - [ - str(self.count()), - str(self.mean().to_numpy().astype("datetime64[ns]")), - str(pd.Timestamp(self.min().astype("datetime64[ns]"))), - ] - + self.quantile(percentiles) - .astype("str") - .to_numpy(na_value=None) - .tolist() - + [str(pd.Timestamp((self.max()).astype("datetime64[ns]")))] - ) + data = { + "count": str(self.count()), + "mean": str(pd.Timestamp(self.mean())), + "min": str(pd.Timestamp(self.min())), + **dict( + zip( + _format_percentile_names(percentiles), + self.quantile(percentiles) + .astype(self.dtype) + .astype("str") + .to_numpy(na_value=np.nan), + ) + ), + "max": str(pd.Timestamp((self.max()))), + } return Series( - data=data, + data=data.values(), dtype="str", - index=index, + index=data.keys(), nan_as_null=False, name=self.name, ) diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py index f214e54c57e..1e11e862329 100644 --- a/python/cudf/cudf/tests/test_series.py +++ b/python/cudf/cudf/tests/test_series.py @@ -12,7 +12,6 @@ import cudf from cudf.testing._utils import ( - DATETIME_TYPES, NUMERIC_TYPES, TIMEDELTA_TYPES, assert_eq, @@ -402,30 +401,21 @@ def test_series_describe_numeric(dtype): assert_eq(expected, actual) -@pytest.mark.xfail(reason="https://github.com/rapidsai/cudf/issues/6219") -@pytest.mark.parametrize("dtype", DATETIME_TYPES) +@pytest.mark.parametrize("dtype", ["datetime64[ns]"]) def test_series_describe_datetime(dtype): + # Note that other datetime units are not tested because pandas does not + # support them. When specified coarser units, cuDF datetime columns cannot + # represent fractional time for quantiles of the column, which may require + # interpolation, this differs from pandas which always stay in [ns] unit. gs = cudf.Series([0, 1, 2, 3, 1, 2, 3], dtype=dtype) ps = gs.to_pandas() - pdf_results = ps.describe(datetime_is_numeric=True) - gdf_results = gs.describe() - - # Assert count - p_count = pdf_results["count"] - g_count = gdf_results["count"] - - assert_eq(int(g_count), p_count) - - # Assert Index - assert_eq(gdf_results.index, pdf_results.index) + # Treating datetimes as categoricals is deprecated in pandas and will + # be removed in future. Future behavior is treating datetime as numeric. + expected = ps.describe(datetime_is_numeric=True) + actual = gs.describe() - # Assert rest of the element apart from - # the first index('count') - actual = gdf_results.tail(-1).astype("datetime64[ns]") - expected = pdf_results.tail(-1).astype("str").astype("datetime64[ns]") - - assert_eq(expected, actual) + assert_eq(expected.astype("str"), actual) @pytest.mark.parametrize("dtype", TIMEDELTA_TYPES) @@ -446,6 +436,13 @@ def test_series_describe_timedelta(dtype): pd.Series([True, False, True, True, False]), pd.Series([], dtype="str"), pd.Series(["a", "b", "c", "a"], dtype="category"), + pd.Series(["d", "e", "f"], dtype="category"), + pd.Series(pd.Categorical(["d", "e", "f"], categories=["f", "e", "d"])), + pd.Series( + pd.Categorical( + ["d", "e", "f"], categories=["f", "e", "d"], ordered=True + ) + ), ], ) def test_series_describe_other_types(ps): From 262715390f739075f0bdac01ff8c92206a1c2fb5 Mon Sep 17 00:00:00 2001 From: GALI PREM SAGAR Date: Tue, 14 Dec 2021 13:58:28 -0600 Subject: [PATCH 3/4] Change default `dtype` of all nulls column from `float` to `object` (#9803) Fixes: #9337 - [x] This PR changes the default `dtype` of `all-nulls` column to `object` dtype from `float64` dtype. - [x] To make `np.nan` values read as `float` column `nan_as_null` has to be passed as `False` in `cudf.DataFrame` constructor - This change is in-line with what is already supported by `cudf.Series` constructor. - [x] Added `has_nans` & `nan_count` property which is needed for some of the checks. - [x] Cached the `nan_count` since it is repeatedly used in math operations and clearing the cache in the regular `_clear_cache` call. - [x] Fixes pytests that are going to break due to this breaking change of types. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - https://github.com/brandon-b-miller - Ashwin Srinath (https://github.com/shwina) URL: https://github.com/rapidsai/cudf/pull/9803 --- python/cudf/cudf/_lib/column.pyi | 3 +- python/cudf/cudf/_lib/column.pyx | 3 +- python/cudf/cudf/core/_base_index.py | 2 +- python/cudf/cudf/core/column/column.py | 34 ++++++---- python/cudf/cudf/core/column/datetime.py | 2 +- python/cudf/cudf/core/column/numerical.py | 64 +++++++++++++++++-- .../cudf/cudf/core/column/numerical_base.py | 9 ++- python/cudf/cudf/core/column/string.py | 15 +---- python/cudf/cudf/core/dataframe.py | 44 +++++++++---- python/cudf/cudf/core/frame.py | 2 +- python/cudf/cudf/core/index.py | 2 +- python/cudf/cudf/core/multiindex.py | 2 +- python/cudf/cudf/core/series.py | 2 +- python/cudf/cudf/core/tools/datetimes.py | 2 +- python/cudf/cudf/core/tools/numeric.py | 2 +- python/cudf/cudf/core/window/rolling.py | 2 +- python/cudf/cudf/tests/test_dataframe.py | 54 +++++++++++++--- python/cudf/cudf/tests/test_interpolate.py | 8 ++- python/cudf/cudf/tests/test_list.py | 15 +++-- python/cudf/cudf/tests/test_onehot.py | 12 +++- python/cudf/cudf/tests/test_repr.py | 6 +- python/cudf/cudf/tests/test_series.py | 2 +- python/cudf/cudf/tests/test_stats.py | 5 +- 23 files changed, 210 insertions(+), 82 deletions(-) diff --git a/python/cudf/cudf/_lib/column.pyi b/python/cudf/cudf/_lib/column.pyi index dafaa8f4d1d..235cb4fd973 100644 --- a/python/cudf/cudf/_lib/column.pyi +++ b/python/cudf/cudf/_lib/column.pyi @@ -70,8 +70,7 @@ class Column: def nullable(self) -> bool: ... - @property - def has_nulls(self) -> bool: + def has_nulls(self, include_nan: bool=False) -> bool: ... @property diff --git a/python/cudf/cudf/_lib/column.pyx b/python/cudf/cudf/_lib/column.pyx index ff3f3050e63..5e0ee3136b7 100644 --- a/python/cudf/cudf/_lib/column.pyx +++ b/python/cudf/cudf/_lib/column.pyx @@ -146,8 +146,7 @@ cdef class Column: def nullable(self): return self.base_mask is not None - @property - def has_nulls(self): + def has_nulls(self, include_nan=False): return self.null_count != 0 @property diff --git a/python/cudf/cudf/core/_base_index.py b/python/cudf/cudf/core/_base_index.py index 2fcc976d8e1..ac5e152d011 100644 --- a/python/cudf/cudf/core/_base_index.py +++ b/python/cudf/cudf/core/_base_index.py @@ -147,7 +147,7 @@ def _clean_nulls_from_index(self): methods using this method to replace or handle representation of the actual types correctly. """ - if self._values.has_nulls: + if self._values.has_nulls(): return cudf.Index( self._values.astype("str").fillna(cudf._NA_REP), name=self.name ) diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py index 1d113f6e159..a98052ce906 100644 --- a/python/cudf/cudf/core/column/column.py +++ b/python/cudf/cudf/core/column/column.py @@ -139,7 +139,7 @@ def values_host(self) -> "np.ndarray": if len(self) == 0: return np.array([], dtype=self.dtype) - if self.has_nulls: + if self.has_nulls(): raise ValueError("Column must have no nulls.") return self.data_array_view.copy_to_host() @@ -152,7 +152,7 @@ def values(self) -> "cupy.ndarray": if len(self) == 0: return cupy.array([], dtype=self.dtype) - if self.has_nulls: + if self.has_nulls(): raise ValueError("Column must have no nulls.") return cupy.asarray(self.data_array_view) @@ -193,7 +193,7 @@ def all(self, skipna: bool = True) -> bool: def any(self, skipna: bool = True) -> bool: # Early exit for fast cases. result_col = self.nans_to_nulls() if skipna else self - if not skipna and result_col.has_nulls: + if not skipna and result_col.has_nulls(): return True elif skipna and result_col.null_count == result_col.size: return False @@ -786,7 +786,7 @@ def as_mask(self) -> Buffer: Buffer """ - if self.has_nulls: + if self.has_nulls(): raise ValueError("Column must have no nulls.") return bools_to_mask(self) @@ -797,13 +797,13 @@ def is_unique(self) -> bool: @property def is_monotonic_increasing(self) -> bool: - return not self.has_nulls and self.as_frame()._is_sorted( + return not self.has_nulls() and self.as_frame()._is_sorted( ascending=None, null_position=None ) @property def is_monotonic_decreasing(self) -> bool: - return not self.has_nulls and self.as_frame()._is_sorted( + return not self.has_nulls() and self.as_frame()._is_sorted( ascending=[False], null_position=None ) @@ -942,7 +942,7 @@ def as_categorical_column(self, dtype, **kwargs) -> ColumnBase: ) # columns include null index in factorization; remove: - if self.has_nulls: + if self.has_nulls(): cats = cats._column.dropna(drop_nan=False) min_type = min_unsigned_type(len(cats), 8) labels = labels - 1 @@ -1216,10 +1216,10 @@ def _process_for_reduction( if skipna: result_col = self.nans_to_nulls() - if result_col.has_nulls: + if result_col.has_nulls(): result_col = result_col.dropna() else: - if self.has_nulls: + if self.has_nulls(): return cudf.utils.dtypes._get_nan_for_dtype(self.dtype) result_col = self @@ -1766,12 +1766,20 @@ def as_column( "https://issues.apache.org/jira/browse/ARROW-3802" ) col = ColumnBase.from_arrow(arbitrary) + if isinstance(arbitrary, pa.NullArray): - if type(dtype) == str and dtype == "empty": - new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype()) + new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype()) + if dtype is not None: + # Cast the column to the `dtype` if specified. + col = col.astype(dtype) + elif len(arbitrary) == 0: + # If the column is empty, it has to be + # a `float64` dtype. + col = col.astype("float64") else: - new_dtype = cudf.dtype(dtype) - col = col.astype(new_dtype) + # If the null column is not empty, it has to + # be of `object` dtype. + col = col.astype(new_dtype) return col diff --git a/python/cudf/cudf/core/column/datetime.py b/python/cudf/cudf/core/column/datetime.py index 08d72f1c6ee..24ec25acbbb 100644 --- a/python/cudf/cudf/core/column/datetime.py +++ b/python/cudf/cudf/core/column/datetime.py @@ -292,7 +292,7 @@ def __cuda_array_interface__(self) -> Mapping[builtins.str, Any]: "version": 1, } - if self.nullable and self.has_nulls: + if self.nullable and self.has_nulls(): # Create a simple Python object that exposes the # `__cuda_array_interface__` attribute here since we need to modify diff --git a/python/cudf/cudf/core/column/numerical.py b/python/cudf/cudf/core/column/numerical.py index becb303feeb..c947440edb1 100644 --- a/python/cudf/cudf/core/column/numerical.py +++ b/python/cudf/cudf/core/column/numerical.py @@ -3,7 +3,16 @@ from __future__ import annotations from types import SimpleNamespace -from typing import Any, Callable, Mapping, Sequence, Tuple, Union, cast +from typing import ( + Any, + Callable, + Mapping, + Optional, + Sequence, + Tuple, + Union, + cast, +) import cupy import numpy as np @@ -47,6 +56,8 @@ class NumericalColumn(NumericalBaseColumn): mask : Buffer, optional """ + _nan_count: Optional[int] + def __init__( self, data: Buffer, @@ -62,7 +73,7 @@ def __init__( raise ValueError("Buffer size must be divisible by element size") if size is None: size = (data.size // dtype.itemsize) - offset - + self._nan_count = None super().__init__( data, size=size, @@ -72,6 +83,10 @@ def __init__( null_count=null_count, ) + def _clear_cache(self): + super()._clear_cache() + self._nan_count = None + def __contains__(self, item: ScalarLike) -> bool: """ Returns True if column contains item, else False. @@ -90,6 +105,11 @@ def __contains__(self, item: ScalarLike) -> bool: self, column.as_column([item], dtype=self.dtype) ).any() + def has_nulls(self, include_nan=False): + return self.null_count != 0 or ( + self.nan_count != 0 if include_nan else False + ) + @property def __cuda_array_interface__(self) -> Mapping[str, Any]: output = { @@ -100,7 +120,7 @@ def __cuda_array_interface__(self) -> Mapping[str, Any]: "version": 1, } - if self.nullable and self.has_nulls: + if self.nullable and self.has_nulls(): # Create a simple Python object that exposes the # `__cuda_array_interface__` attribute here since we need to modify @@ -280,6 +300,15 @@ def as_numerical_column(self, dtype: Dtype, **kwargs) -> NumericalColumn: return self return libcudf.unary.cast(self, dtype) + @property + def nan_count(self) -> int: + if self.dtype.kind != "f": + self._nan_count = 0 + elif self._nan_count is None: + nan_col = libcudf.unary.is_nan(self) + self._nan_count = nan_col.sum() + return self._nan_count + def _process_values_for_isin( self, values: Sequence ) -> Tuple[ColumnBase, ColumnBase]: @@ -296,6 +325,20 @@ def _process_values_for_isin( return lhs, rhs + def _can_return_nan(self, skipna: bool = None) -> bool: + return not skipna and self.has_nulls(include_nan=True) + + def _process_for_reduction( + self, skipna: bool = None, min_count: int = 0 + ) -> Union[ColumnBase, ScalarLike]: + skipna = True if skipna is None else skipna + + if self._can_return_nan(skipna=skipna): + return cudf.utils.dtypes._get_nan_for_dtype(self.dtype) + return super()._process_for_reduction( + skipna=skipna, min_count=min_count + ) + def _default_na_value(self) -> ScalarLike: """Returns the default NA value for this column""" dkind = self.dtype.kind @@ -319,8 +362,19 @@ def find_and_replace( """ Return col with *to_replace* replaced with *value*. """ + + # If all of `to_replace`/`replacement` are `None`, + # dtype of `to_replace_col`/`replacement_col` + # is inferred as `string`, but this is a valid + # float64 column too, Hence we will need to type-cast + # to self.dtype. to_replace_col = column.as_column(to_replace) + if to_replace_col.null_count == len(to_replace_col): + to_replace_col = to_replace_col.astype(self.dtype) + replacement_col = column.as_column(replacement) + if replacement_col.null_count == len(replacement_col): + replacement_col = replacement_col.astype(self.dtype) if type(to_replace_col) != type(replacement_col): raise TypeError( @@ -578,7 +632,7 @@ def to_pandas( arrow_array = self.to_arrow() pandas_array = pandas_nullable_dtype.__from_arrow__(arrow_array) pd_series = pd.Series(pandas_array, copy=False) - elif str(self.dtype) in NUMERIC_TYPES and not self.has_nulls: + elif str(self.dtype) in NUMERIC_TYPES and not self.has_nulls(): pd_series = pd.Series(cupy.asnumpy(self.values), copy=False) else: pd_series = self.to_arrow().to_pandas(**kwargs) @@ -597,6 +651,8 @@ def _normalize_find_and_replace_input( ) col_to_normalize_dtype = normalized_column.dtype if isinstance(col_to_normalize, list): + if normalized_column.null_count == len(normalized_column): + normalized_column = normalized_column.astype(input_column_dtype) col_to_normalize_dtype = min_column_type( normalized_column, input_column_dtype ) diff --git a/python/cudf/cudf/core/column/numerical_base.py b/python/cudf/cudf/core/column/numerical_base.py index 853fb360c50..1f84cb88e37 100644 --- a/python/cudf/cudf/core/column/numerical_base.py +++ b/python/cudf/cudf/core/column/numerical_base.py @@ -77,10 +77,13 @@ def sum_of_squares( "sum_of_squares", skipna=skipna, dtype=dtype, min_count=min_count ) + def _can_return_nan(self, skipna: bool = None) -> bool: + return not skipna and self.has_nulls() + def kurtosis(self, skipna: bool = None) -> float: skipna = True if skipna is None else skipna - if len(self) == 0 or (not skipna and self.has_nulls): + if len(self) == 0 or self._can_return_nan(skipna=skipna): return cudf.utils.dtypes._get_nan_for_dtype(self.dtype) self = self.nans_to_nulls().dropna() # type: ignore @@ -105,7 +108,7 @@ def kurtosis(self, skipna: bool = None) -> float: def skew(self, skipna: bool = None) -> ScalarLike: skipna = True if skipna is None else skipna - if len(self) == 0 or (not skipna and self.has_nulls): + if len(self) == 0 or self._can_return_nan(skipna=skipna): return cudf.utils.dtypes._get_nan_for_dtype(self.dtype) self = self.nans_to_nulls().dropna() # type: ignore @@ -148,7 +151,7 @@ def quantile( def median(self, skipna: bool = None) -> NumericalBaseColumn: skipna = True if skipna is None else skipna - if not skipna and self.has_nulls: + if self._can_return_nan(skipna=skipna): return cudf.utils.dtypes._get_nan_for_dtype(self.dtype) # enforce linear in case the default ever changes diff --git a/python/cudf/cudf/core/column/string.py b/python/cudf/cudf/core/column/string.py index 2a91abc5701..1c9a013810a 100644 --- a/python/cudf/cudf/core/column/string.py +++ b/python/cudf/cudf/core/column/string.py @@ -5018,7 +5018,7 @@ def edit_distance_matrix(self) -> ParentType: raise ValueError( "Require size >= 2 to compute edit distance matrix." ) - if self._column.has_nulls: + if self._column.has_nulls(): raise ValueError( "Cannot compute edit distance between null strings. " "Consider removing them using `dropna` or fill with `fillna`." @@ -5440,20 +5440,7 @@ def find_and_replace( """ to_replace_col = column.as_column(to_replace) - if to_replace_col.null_count == len(to_replace_col): - # If all of `to_replace` are `None`, dtype of `to_replace_col` - # is inferred as `float64`, but this is a valid - # string column too, Hence we will need to type-cast - # to self.dtype. - to_replace_col = to_replace_col.astype(self.dtype) - replacement_col = column.as_column(replacement) - if replacement_col.null_count == len(replacement_col): - # If all of `replacement` are `None`, dtype of `replacement_col` - # is inferred as `float64`, but this is a valid - # string column too, Hence we will need to type-cast - # to self.dtype. - replacement_col = replacement_col.astype(self.dtype) if type(to_replace_col) != type(replacement_col): raise TypeError( diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py index bbe691595e7..88c8aaebd9e 100644 --- a/python/cudf/cudf/core/dataframe.py +++ b/python/cudf/cudf/core/dataframe.py @@ -440,6 +440,11 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin): Data type to force. Only a single dtype is allowed. If None, infer. + nan_as_null : bool, Default True + If ``None``/``True``, converts ``np.nan`` values to + ``null`` values. + If ``False``, leaves ``np.nan`` values as is. + Examples -------- @@ -514,7 +519,9 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin): _iloc_indexer_type = _DataFrameIlocIndexer @annotate("DATAFRAME_INIT", color="blue", domain="cudf_python") - def __init__(self, data=None, index=None, columns=None, dtype=None): + def __init__( + self, data=None, index=None, columns=None, dtype=None, nan_as_null=True + ): super().__init__() @@ -523,7 +530,7 @@ def __init__(self, data=None, index=None, columns=None, dtype=None): if isinstance(data, (DataFrame, pd.DataFrame)): if isinstance(data, pd.DataFrame): - data = self.from_pandas(data) + data = self.from_pandas(data, nan_as_null=nan_as_null) if index is not None: if not data.index.equals(index): @@ -546,11 +553,14 @@ def __init__(self, data=None, index=None, columns=None, dtype=None): self.columns = data.columns elif isinstance(data, (cudf.Series, pd.Series)): if isinstance(data, pd.Series): - data = cudf.Series.from_pandas(data) + data = cudf.Series.from_pandas(data, nan_as_null=nan_as_null) name = data.name or 0 self._init_from_dict_like( - {name: data}, index=index, columns=columns + {name: data}, + index=index, + columns=columns, + nan_as_null=nan_as_null, ) elif data is None: if index is None: @@ -620,7 +630,9 @@ def __init__(self, data=None, index=None, columns=None, dtype=None): if not is_dict_like(data): raise TypeError("data must be list or dict-like") - self._init_from_dict_like(data, index=index, columns=columns) + self._init_from_dict_like( + data, index=index, columns=columns, nan_as_null=nan_as_null + ) if dtype: self._data = self.astype(dtype)._data @@ -759,7 +771,9 @@ def _init_from_list_like(self, data, index=None, columns=None): self.columns = columns - def _init_from_dict_like(self, data, index=None, columns=None): + def _init_from_dict_like( + self, data, index=None, columns=None, nan_as_null=None + ): if columns is not None: # remove all entries in `data` that are # not in `columns` @@ -794,7 +808,9 @@ def _init_from_dict_like(self, data, index=None, columns=None): if is_scalar(data[col_name]): num_rows = num_rows or 1 else: - data[col_name] = column.as_column(data[col_name]) + data[col_name] = column.as_column( + data[col_name], nan_as_null=nan_as_null + ) num_rows = len(data[col_name]) self._index = RangeIndex(0, num_rows) else: @@ -806,7 +822,9 @@ def _init_from_dict_like(self, data, index=None, columns=None): self._data.multiindex = self._data.multiindex and isinstance( col_name, tuple ) - self.insert(i, col_name, data[col_name]) + self.insert( + i, col_name, data[col_name], nan_as_null=nan_as_null + ) if columns is not None: self.columns = columns @@ -1747,7 +1765,7 @@ def _clean_nulls_from_dataframe(self, df): if is_list_dtype(df._data[col]) or is_struct_dtype(df._data[col]): # TODO we need to handle this pass - elif df._data[col].has_nulls: + elif df._data[col].has_nulls(): df[col] = df._data[col].astype("str").fillna(cudf._NA_REP) else: df[col] = df._data[col] @@ -2582,7 +2600,7 @@ def take(self, indices, axis=0, keep_index=None): return out @annotate("INSERT", color="green", domain="cudf_python") - def insert(self, loc, name, value): + def insert(self, loc, name, value, nan_as_null=None): """Add a column to DataFrame at the index specified by loc. Parameters @@ -2625,11 +2643,11 @@ def insert(self, loc, name, value): ) self._data = new_data elif isinstance(value, (pd.Series, Series)): - value = Series(value)._align_to_index( + value = Series(value, nan_as_null=nan_as_null)._align_to_index( self._index, how="right", sort=False ) - value = column.as_column(value) + value = column.as_column(value, nan_as_null=nan_as_null) self._data.insert(name, value, loc=loc) @@ -3081,7 +3099,7 @@ def as_gpu_matrix(self, columns=None, order="F"): dtype = find_common_type([col.dtype for col in cols]) for k, c in self._data.items(): - if c.has_nulls: + if c.has_nulls(): raise ValueError( f"column '{k}' has null values. " f"hint: use .fillna() to replace null values" diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py index 61ce64e7d6b..c85ed0c8555 100644 --- a/python/cudf/cudf/core/frame.py +++ b/python/cudf/cudf/core/frame.py @@ -4755,7 +4755,7 @@ def _scan(self, op, axis=None, skipna=True, cast_to_int=False): result_col = self._data[name].nans_to_nulls() else: result_col = self._data[name].copy() - if result_col.has_nulls: + if result_col.has_nulls(include_nan=True): # Workaround as find_first_value doesn't seem to work # incase of bools. first_index = int( diff --git a/python/cudf/cudf/core/index.py b/python/cudf/cudf/core/index.py index 0002aaf38c5..29e0d17bc39 100644 --- a/python/cudf/cudf/core/index.py +++ b/python/cudf/cudf/core/index.py @@ -2515,7 +2515,7 @@ def _clean_nulls_from_index(self): Convert all na values(if any) in Index object to `` as a preprocessing step to `__repr__` methods. """ - if self._values.has_nulls: + if self._values.has_nulls(): return self.fillna(cudf._NA_REP) else: return self diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py index 65c79b4cf59..c403c697e3d 100644 --- a/python/cudf/cudf/core/multiindex.py +++ b/python/cudf/cudf/core/multiindex.py @@ -386,7 +386,7 @@ def __repr__(self): else: preprocess = self.copy(deep=False) - if any(col.has_nulls for col in preprocess._data.columns): + if any(col.has_nulls() for col in preprocess._data.columns): preprocess_df = preprocess.to_frame(index=False) for name, col in preprocess._data.items(): if isinstance( diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py index bbeae1adc5e..036c8c1ee00 100644 --- a/python/cudf/cudf/core/series.py +++ b/python/cudf/cudf/core/series.py @@ -1474,7 +1474,7 @@ def has_nulls(self): >>> series.dropna().has_nulls False """ - return self._column.has_nulls + return self._column.has_nulls() def dropna(self, axis=0, inplace=False, how=None): """ diff --git a/python/cudf/cudf/core/tools/datetimes.py b/python/cudf/cudf/core/tools/datetimes.py index 34d62ffc048..3efbd982b53 100644 --- a/python/cudf/cudf/core/tools/datetimes.py +++ b/python/cudf/cudf/core/tools/datetimes.py @@ -330,7 +330,7 @@ def _process_col(col, unit, dayfirst, infer_datetime_format, format): col = col.as_datetime_column(dtype=_unit_dtype_map[unit]) elif col.dtype.kind in ("O"): - if unit not in (None, "ns"): + if unit not in (None, "ns") or col.null_count == len(col): try: col = col.astype(dtype="int64") except ValueError: diff --git a/python/cudf/cudf/core/tools/numeric.py b/python/cudf/cudf/core/tools/numeric.py index 7c688b92009..bd1b505c57f 100644 --- a/python/cudf/cudf/core/tools/numeric.py +++ b/python/cudf/cudf/core/tools/numeric.py @@ -165,7 +165,7 @@ def to_numeric(arg, errors="raise", downcast=None): if isinstance(arg, (cudf.Series, pd.Series)): return cudf.Series(col) else: - if col.has_nulls: + if col.has_nulls(): # To match pandas, always return a floating type filled with nan. col = col.astype(float).fillna(np.nan) return col.values diff --git a/python/cudf/cudf/core/window/rolling.py b/python/cudf/cudf/core/window/rolling.py index 617dbdeaea5..0f4256e49a6 100644 --- a/python/cudf/cudf/core/window/rolling.py +++ b/python/cudf/cudf/core/window/rolling.py @@ -326,7 +326,7 @@ def apply(self, func, *args, **kwargs): """ has_nulls = False if isinstance(self.obj, cudf.Series): - if self.obj._column.has_nulls: + if self.obj._column.has_nulls(): has_nulls = True else: for col in self.obj._data: diff --git a/python/cudf/cudf/tests/test_dataframe.py b/python/cudf/cudf/tests/test_dataframe.py index c40f9f0b0a5..ab0856fad1e 100644 --- a/python/cudf/cudf/tests/test_dataframe.py +++ b/python/cudf/cudf/tests/test_dataframe.py @@ -246,7 +246,7 @@ def test_series_init_none(): sr1 = cudf.Series() got = sr1.to_string() - expect = "Series([], dtype: float64)" + expect = sr1.to_pandas().__repr__() # values should match despite whitespace difference assert got.split() == expect.split() @@ -254,7 +254,7 @@ def test_series_init_none(): sr2 = cudf.Series(None) got = sr2.to_string() - expect = "Series([], dtype: float64)" + expect = sr2.to_pandas().__repr__() # values should match despite whitespace difference assert got.split() == expect.split() @@ -1308,7 +1308,7 @@ def test_concat_empty_dataframe(df_1, df_2): [ {"a": [1, 2], "b": [1, 2], "c": ["s1", "s2"], "d": [1.0, 2.0]}, {"b": [1.9, 10.9], "c": ["s1", "s2"]}, - {"c": ["s1"], "b": [None], "a": [False]}, + {"c": ["s1"], "b": pd.Series([None], dtype="float"), "a": [False]}, ], ) @pytest.mark.parametrize( @@ -2008,8 +2008,8 @@ def test_dataframe_count_reduction(data, func): {"x": [np.nan, 2, 3, 4, 100, np.nan], "y": [4, 5, 6, 88, 99, np.nan]}, {"x": [1, 2, 3], "y": [4, 5, 6]}, {"x": [np.nan, np.nan, np.nan], "y": [np.nan, np.nan, np.nan]}, - {"x": [], "y": []}, - {"x": []}, + {"x": pd.Series([], dtype="float"), "y": pd.Series([], dtype="float")}, + {"x": pd.Series([], dtype="int")}, ], ) @pytest.mark.parametrize("ops", ["sum", "product", "prod"]) @@ -2017,7 +2017,7 @@ def test_dataframe_count_reduction(data, func): @pytest.mark.parametrize("min_count", [-10, -1, 0, 1, 2, 3, 10]) def test_dataframe_min_count_ops(data, ops, skipna, min_count): psr = pd.DataFrame(data) - gsr = cudf.DataFrame(data) + gsr = cudf.from_pandas(psr) assert_eq( getattr(psr, ops)(skipna=skipna, min_count=min_count), @@ -2498,7 +2498,7 @@ def test_series_all_null(num_elements, null_type): # Typecast Pandas because None will return `object` dtype expect = pd.Series(data, dtype="float64") - got = cudf.Series(data) + got = cudf.Series(data, dtype="float64") assert_eq(expect, got) @@ -8480,10 +8480,10 @@ def test_agg_for_dataframe_with_string_columns(aggs): ) def test_update_for_dataframes(data, data2, join, overwrite, errors): pdf = pd.DataFrame(data) - gdf = cudf.DataFrame(data) + gdf = cudf.DataFrame(data, nan_as_null=False) other_pd = pd.DataFrame(data2) - other_gd = cudf.DataFrame(data2) + other_gd = cudf.DataFrame(data2, nan_as_null=False) pdf.update(other=other_pd, join=join, overwrite=overwrite, errors=errors) gdf.update(other=other_gd, join=join, overwrite=overwrite, errors=errors) @@ -8949,7 +8949,9 @@ def test_frame_series_where_other(data): ( { "id": ["a", "a", "b", "b", "c", "c"], - "val": [None, None, None, None, None, None], + "val": cudf.Series( + [None, None, None, None, None, None], dtype="float64" + ), }, ["id"], ), @@ -9041,6 +9043,38 @@ def test_pearson_corr_multiindex_dataframe(): assert_eq(expected, actual) +@pytest.mark.parametrize( + "data", + [ + {"a": [np.nan, 1, 2], "b": [None, None, None]}, + {"a": [1, 2, np.nan, 2], "b": [np.nan, np.nan, np.nan, np.nan]}, + { + "a": [1, 2, np.nan, 2, None], + "b": [np.nan, np.nan, None, np.nan, np.nan], + }, + {"a": [1, 2, 2, None, 1.1], "b": [1, 2.2, 3, None, 5]}, + ], +) +@pytest.mark.parametrize("nan_as_null", [True, False]) +def test_dataframe_constructor_nan_as_null(data, nan_as_null): + actual = cudf.DataFrame(data, nan_as_null=nan_as_null) + + if nan_as_null: + assert ( + not ( + actual.astype("float").replace( + cudf.Series([np.nan], nan_as_null=False), cudf.Series([-1]) + ) + == -1 + ) + .any() + .any() + ) + else: + actual = actual.select_dtypes(exclude=["object"]) + assert (actual.replace(np.nan, -1) == -1).any().any() + + def test_dataframe_add_prefix(): cdf = cudf.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]}) pdf = cdf.to_pandas() diff --git a/python/cudf/cudf/tests/test_interpolate.py b/python/cudf/cudf/tests/test_interpolate.py index 66556c48828..2c544dfc17c 100644 --- a/python/cudf/cudf/tests/test_interpolate.py +++ b/python/cudf/cudf/tests/test_interpolate.py @@ -50,7 +50,9 @@ def test_interpolate_series(data, method, axis): expect = psr.interpolate(method=method, axis=axis) got = gsr.interpolate(method=method, axis=axis) - assert_eq(expect, got) + assert_eq( + expect, got, check_dtype=False if psr.dtype == "object" else True + ) @pytest.mark.parametrize( @@ -88,7 +90,9 @@ def test_interpolate_series_values_or_index(data, index, method): expect = psr.interpolate(method=method) got = gsr.interpolate(method=method) - assert_eq(expect, got) + assert_eq( + expect, got, check_dtype=False if psr.dtype == "object" else True + ) @pytest.mark.parametrize( diff --git a/python/cudf/cudf/tests/test_list.py b/python/cudf/cudf/tests/test_list.py index 2b71ca7ac36..b898222d7d7 100644 --- a/python/cudf/cudf/tests/test_list.py +++ b/python/cudf/cudf/tests/test_list.py @@ -76,10 +76,14 @@ def test_leaves(data): pa_array = pa.array(data) while hasattr(pa_array, "flatten"): pa_array = pa_array.flatten() - dtype = "int8" if isinstance(pa_array, pa.NullArray) else None - expect = cudf.Series(pa_array, dtype=dtype) + + expect = cudf.Series(pa_array) got = cudf.Series(data).list.leaves - assert_eq(expect, got) + assert_eq( + expect, + got, + check_dtype=False if isinstance(pa_array, pa.NullArray) else True, + ) def test_list_to_pandas_nullable_true(): @@ -269,7 +273,10 @@ def test_get(data, index, expect): sr = cudf.Series(data) expect = cudf.Series(expect) got = sr.list.get(index) - assert_eq(expect, got) + + assert_eq( + expect, got, check_dtype=False if expect.isnull().all() else True + ) def test_get_nested_lists(): diff --git a/python/cudf/cudf/tests/test_onehot.py b/python/cudf/cudf/tests/test_onehot.py index ed55fb86820..f2a20a73b63 100644 --- a/python/cudf/cudf/tests/test_onehot.py +++ b/python/cudf/cudf/tests/test_onehot.py @@ -113,10 +113,18 @@ def test_get_dummies(data): encoded_expected = pd.get_dummies(pdf, prefix="test") encoded_actual = cudf.get_dummies(gdf, prefix="test") - utils.assert_eq(encoded_expected, encoded_actual) + utils.assert_eq( + encoded_expected, + encoded_actual, + check_dtype=False if len(data) == 0 else True, + ) encoded_actual = cudf.get_dummies(gdf, prefix="test", dtype=np.uint8) - utils.assert_eq(encoded_expected, encoded_actual) + utils.assert_eq( + encoded_expected, + encoded_actual, + check_dtype=False if len(data) == 0 else True, + ) @pytest.mark.parametrize("n_cols", [5, 10, 20]) diff --git a/python/cudf/cudf/tests/test_repr.py b/python/cudf/cudf/tests/test_repr.py index 736bcf131cc..fe95b2930df 100644 --- a/python/cudf/cudf/tests/test_repr.py +++ b/python/cudf/cudf/tests/test_repr.py @@ -328,10 +328,14 @@ def test_dataframe_sliced(gdf, slice, max_seq_items, max_rows): ), ( cudf.Index([None, None, None], name="hello"), + "StringIndex([None None None], dtype='object', name='hello')", + ), + ( + cudf.Index([None, None, None], dtype="float", name="hello"), "Float64Index([, , ], dtype='float64', name='hello')", ), ( - cudf.Index([None], name="hello"), + cudf.Index([None], dtype="float64", name="hello"), "Float64Index([], dtype='float64', name='hello')", ), ( diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py index 1e11e862329..583d2c7a8dd 100644 --- a/python/cudf/cudf/tests/test_series.py +++ b/python/cudf/cudf/tests/test_series.py @@ -971,7 +971,7 @@ def test_series_update(data, other): @pytest.mark.parametrize("nan_as_null", [True, False]) @pytest.mark.parametrize("fill_value", [1.2, 332, np.nan]) def test_fillna_with_nan(data, nan_as_null, fill_value): - gs = cudf.Series(data, nan_as_null=nan_as_null) + gs = cudf.Series(data, dtype="float64", nan_as_null=nan_as_null) ps = gs.to_pandas() expected = ps.fillna(fill_value) diff --git a/python/cudf/cudf/tests/test_stats.py b/python/cudf/cudf/tests/test_stats.py index ebe78d56c3f..142ca6c6831 100644 --- a/python/cudf/cudf/tests/test_stats.py +++ b/python/cudf/cudf/tests/test_stats.py @@ -460,7 +460,8 @@ def test_df_corr(): @pytest.mark.parametrize("skipna", [True, False, None]) def test_nans_stats(data, ops, skipna): psr = cudf.utils.utils._create_pandas_series(data=data) - gsr = cudf.Series(data) + gsr = cudf.Series(data, nan_as_null=False) + assert_eq( getattr(psr, ops)(skipna=skipna), getattr(gsr, ops)(skipna=skipna) ) @@ -486,7 +487,7 @@ def test_nans_stats(data, ops, skipna): @pytest.mark.parametrize("min_count", [-10, -1, 0, 1, 2, 3, 5, 10]) def test_min_count_ops(data, ops, skipna, min_count): psr = pd.Series(data) - gsr = cudf.Series(data) + gsr = cudf.Series(data, nan_as_null=False) assert_eq( getattr(psr, ops)(skipna=skipna, min_count=min_count), From 7a23f1a01547648db7ad684fa3dc0482b7ac813f Mon Sep 17 00:00:00 2001 From: David Wendt <45795991+davidwendt@users.noreply.github.com> Date: Tue, 14 Dec 2021 15:28:59 -0500 Subject: [PATCH 4/4] Add utility to format ninja-log build times (#9631) Reference: https://github.com/rapidsai/ops/issues/1896 Generate build times log from formatted, sorted `.ninja_log` file. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Karthikeyan (https://github.com/karthikeyann) URL: https://github.com/rapidsai/cudf/pull/9631 --- build.sh | 18 +++++ ci/gpu/build.sh | 12 +++- cpp/scripts/sort_ninja_log.py | 121 ++++++++++++++++++++++++++++++++++ 3 files changed, 150 insertions(+), 1 deletion(-) create mode 100755 cpp/scripts/sort_ninja_log.py diff --git a/build.sh b/build.sh index d0ccd4821e0..adf6e220744 100755 --- a/build.sh +++ b/build.sh @@ -172,6 +172,12 @@ if buildAll || hasArg libcudf; then echo "Building for *ALL* supported GPU architectures..." fi + # get the current count before the compile starts + FILES_IN_CCACHE="" + if [ -x "$(command -v ccache)" ]; then + FILES_IN_CCACHE=$(ccache -s | grep "files in cache") + fi + cmake -S $REPODIR/cpp -B ${LIB_BUILD_DIR} \ -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} \ ${CUDF_CMAKE_CUDA_ARCHITECTURES} \ @@ -185,7 +191,19 @@ if buildAll || hasArg libcudf; then cd ${LIB_BUILD_DIR} + compile_start=$(date +%s) cmake --build . -j${PARALLEL_LEVEL} ${VERBOSE_FLAG} + compile_end=$(date +%s) + compile_total=$(( compile_end - compile_start )) + + # Record build times + if [[ -f "${LIB_BUILD_DIR}/.ninja_log" ]]; then + echo "Formatting build times" + python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt xml > ${LIB_BUILD_DIR}/ninja_log.xml + message="$FILES_IN_CCACHE

$PARALLEL_LEVEL parallel build time is $compile_total seconds" + echo "$message" + python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt html --msg "$message" > ${LIB_BUILD_DIR}/ninja_log.html + fi if [[ ${INSTALL_TARGET} != "" ]]; then cmake --build . -j${PARALLEL_LEVEL} --target install ${VERBOSE_FLAG} diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh index d8b5cc7ba4c..00ad6bf812d 100755 --- a/ci/gpu/build.sh +++ b/ci/gpu/build.sh @@ -166,16 +166,26 @@ else gpuci_logger "Check GPU usage" nvidia-smi - gpuci_logger "GoogleTests" set -x cd $LIB_BUILD_DIR + gpuci_logger "GoogleTests" + for gt in gtests/* ; do test_name=$(basename ${gt}) echo "Running GoogleTest $test_name" ${gt} --gtest_output=xml:"$WORKSPACE/test-results/" done + # Copy libcudf build time results + echo "Checking for build time log $LIB_BUILD_DIR/ninja_log.html" + if [[ -f "$LIB_BUILD_DIR/ninja_log.html" ]]; then + gpuci_logger "Copying build time results" + cp "$LIB_BUILD_DIR/ninja_log.xml" "$WORKSPACE/test-results/buildtimes-junit.xml" + mkdir -p "$WORKSPACE/build-metrics" + cp "$LIB_BUILD_DIR/ninja_log.html" "$WORKSPACE/build-metrics/BuildMetrics.html" + fi + ################################################################################ # MEMCHECK - Run compute-sanitizer on GoogleTest (only in nightly builds) ################################################################################ diff --git a/cpp/scripts/sort_ninja_log.py b/cpp/scripts/sort_ninja_log.py new file mode 100755 index 00000000000..5eada13aea2 --- /dev/null +++ b/cpp/scripts/sort_ninja_log.py @@ -0,0 +1,121 @@ +# +# Copyright (c) 2021, NVIDIA CORPORATION. +# +import argparse +import os +import sys +import xml.etree.ElementTree as ET +from xml.dom import minidom + +parser = argparse.ArgumentParser() +parser.add_argument( + "log_file", type=str, default=".ninja_log", help=".ninja_log file" +) +parser.add_argument( + "--fmt", + type=str, + default="csv", + choices=["csv", "xml", "html"], + help="output format (to stdout)", +) +parser.add_argument( + "--msg", + type=str, + default=None, + help="optional message to include in html output", +) +args = parser.parse_args() + +log_file = args.log_file +log_path = os.path.dirname(os.path.abspath(log_file)) + +output_fmt = args.fmt + +# build a map of the log entries +entries = {} +with open(log_file, "r") as log: + for line in log: + entry = line.split() + if len(entry) > 4: + elapsed = int(entry[1]) - int(entry[0]) + obj_file = entry[3] + file_size = ( + os.path.getsize(os.path.join(log_path, obj_file)) + if os.path.exists(obj_file) + else 0 + ) + entries[entry[3]] = (elapsed, file_size) + +# check file could be loaded +if len(entries) == 0: + print("Could not parse", log_file) + exit() + +# sort the keys by build time (descending order) +keys = list(entries.keys()) +sl = sorted(keys, key=lambda k: entries[k][0], reverse=True) + +if output_fmt == "xml": + # output results in XML format + root = ET.Element("testsuites") + testsuite = ET.Element( + "testsuite", + attrib={ + "name": "build-time", + "tests": str(len(keys)), + "failures": str(0), + "errors": str(0), + }, + ) + root.append(testsuite) + for key in sl: + entry = entries[key] + elapsed = float(entry[0]) / 1000 + item = ET.Element( + "testcase", + attrib={ + "classname": "BuildTime", + "name": key, + "time": str(elapsed), + }, + ) + testsuite.append(item) + + tree = ET.ElementTree(root) + xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ") + print(xmlstr) + +elif output_fmt == "html": + # output results in HTML format + print("Sorted Ninja Build Times") + print("") + print("") + if args.msg is not None: + print("

", args.msg, "

") + print("
") + print( + "", + "", + "", + sep="", + ) + for key in sl: + result = entries[key] + print( + "", + sep="", + ) + print("
FileCompile time (ms)Size (bytes)
", + key, + "", + result[0], + "", + result[1], + "
") + +else: + # output results in CSV format + print("time,size,file") + for key in sl: + result = entries[key] + print(result[0], result[1], key, sep=",")