From 335862bf5bc8433d5d0896adf8210a13c5a21ab3 Mon Sep 17 00:00:00 2001
From: Alessandro Bellina <abellina@nvidia.com>
Date: Mon, 13 Dec 2021 15:08:39 -0600
Subject: [PATCH 1/4] Fix fallback to sort aggregation for grouping only hash
 aggregate (#9891)

The following fixes what looks like an unintended fallback to sort aggregate introduced in https://github.com/rapidsai/cudf/pull/9545 for a grouping only (no aggregation request) case.

In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).

I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`:

```
  for (auto _ : state) {
    cuda_event_timer timer(state, true);
    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e
    auto result = gb_obj.aggregate(requests);
  }
```

This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case.

If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark.

Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.

This was executed on a T4 GPU.

Before the patch:

```
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
```
<details>
<summary>Full output</summary>
<p>

```
2021-12-12T13:41:08+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2801.89 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 1.01, 0.78, 0.42
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.117 ms        0.143 ms         5906
Groupby/Basic/1000000/manual_time                      0.524 ms        0.542 ms         1352
Groupby/Basic/10000000/manual_time                      4.41 ms         4.43 ms          159
Groupby/Basic/100000000/manual_time                     50.1 ms         50.1 ms           13
Groupby/PreSorted/1000000/manual_time                  0.332 ms        0.350 ms         2118
Groupby/PreSorted/10000000/manual_time                  2.22 ms         2.23 ms          315
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           30
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4381
Groupby/PreSortedNth/10000000/manual_time              0.890 ms        0.917 ms          774
Groupby/PreSortedNth/100000000/manual_time              8.43 ms         8.46 ms           68
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          902
Groupby/Shift/10000000/manual_time                      9.51 ms         9.53 ms           63
Groupby/Shift/100000000/manual_time                      145 ms          145 ms            4
Groupby/Aggregation/10000/manual_time                   1.56 ms         1.58 ms          442
Groupby/Aggregation/16384/manual_time                   1.59 ms         1.62 ms          435
Groupby/Aggregation/65536/manual_time                   1.73 ms         1.76 ms          404
Groupby/Aggregation/262144/manual_time                  2.95 ms         2.98 ms          237
Groupby/Aggregation/1048576/manual_time                 9.20 ms         9.23 ms           73
Groupby/Aggregation/4194304/manual_time                 36.3 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                92.0 ms         92.1 ms            7
Groupby/Scan/10000/manual_time                          1.56 ms         1.58 ms          447
Groupby/Scan/16384/manual_time                          1.62 ms         1.65 ms          429
Groupby/Scan/65536/manual_time                          1.85 ms         1.88 ms          378
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          197
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.6 ms         48.6 ms           14
Groupby/Scan/10000000/manual_time                        126 ms          126 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
Groupby/PreSortedNoRequests/1000000/manual_time        0.194 ms        0.214 ms         3624
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          571
Groupby/PreSortedNoRequests/100000000/manual_time       12.6 ms         12.7 ms           50
```

</details>

After the patch:

```
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
```

<details>
<summary>Full output</summary>
<p>

```
2021-12-12T13:37:50+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2654.22 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 0.64, 0.50, 0.26
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.116 ms        0.142 ms         5918
Groupby/Basic/1000000/manual_time                      0.523 ms        0.542 ms         1374
Groupby/Basic/10000000/manual_time                      4.37 ms         4.39 ms          162
Groupby/Basic/100000000/manual_time                     51.4 ms         51.5 ms           10
Groupby/PreSorted/1000000/manual_time                  0.331 ms        0.350 ms         2121
Groupby/PreSorted/10000000/manual_time                  2.21 ms         2.23 ms          316
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           27
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4384
Groupby/PreSortedNth/10000000/manual_time              0.888 ms        0.915 ms          775
Groupby/PreSortedNth/100000000/manual_time              8.36 ms         8.39 ms           70
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          904
Groupby/Shift/10000000/manual_time                      9.50 ms         9.52 ms           63
Groupby/Shift/100000000/manual_time                      146 ms          146 ms            4
Groupby/Aggregation/10000/manual_time                   1.53 ms         1.55 ms          446
Groupby/Aggregation/16384/manual_time                   1.58 ms         1.61 ms          437
Groupby/Aggregation/65536/manual_time                   1.72 ms         1.75 ms          405
Groupby/Aggregation/262144/manual_time                  2.93 ms         2.96 ms          236
Groupby/Aggregation/1048576/manual_time                 9.18 ms         9.21 ms           74
Groupby/Aggregation/4194304/manual_time                 36.2 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                91.5 ms         91.6 ms            7
Groupby/Scan/10000/manual_time                          1.55 ms         1.57 ms          452
Groupby/Scan/16384/manual_time                          1.60 ms         1.62 ms          434
Groupby/Scan/65536/manual_time                          1.84 ms         1.87 ms          379
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          198
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.4 ms         48.4 ms           14
Groupby/Scan/10000000/manual_time                        125 ms          125 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
Groupby/PreSortedNoRequests/1000000/manual_time        0.195 ms        0.215 ms         3604
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          575
Groupby/PreSortedNoRequests/100000000/manual_time       12.7 ms         12.8 ms           50
```

</details>

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)

URL: https://github.com/rapidsai/cudf/pull/9891
---
 cpp/benchmarks/CMakeLists.txt                 |   8 +-
 .../groupby/group_no_requests_benchmark.cu    | 115 ++++++++++++++++++
 cpp/src/groupby/hash/groupby.cu               |  27 ++--
 3 files changed, 131 insertions(+), 19 deletions(-)
 create mode 100644 cpp/benchmarks/groupby/group_no_requests_benchmark.cu
diff --git a/cpp/benchmarks/CMakeLists.txt b/cpp/benchmarks/CMakeLists.txt
index 71b452a1004..63f6857ee08 100644
--- a/cpp/benchmarks/CMakeLists.txt
+++ b/cpp/benchmarks/CMakeLists.txt
@@ -171,8 +171,12 @@ ConfigureBench(FILL_BENCH filling/repeat_benchmark.cpp)
 # ##################################################################################################
 # * groupby benchmark -----------------------------------------------------------------------------
 ConfigureBench(
-  GROUPBY_BENCH groupby/group_sum_benchmark.cu groupby/group_nth_benchmark.cu
-  groupby/group_shift_benchmark.cu groupby/group_struct_benchmark.cu
+  GROUPBY_BENCH
+  groupby/group_sum_benchmark.cu
+  groupby/group_nth_benchmark.cu
+  groupby/group_shift_benchmark.cu
+  groupby/group_struct_benchmark.cu
+  groupby/group_no_requests_benchmark.cu
 )
 
 # ##################################################################################################
diff --git a/cpp/benchmarks/groupby/group_no_requests_benchmark.cu b/cpp/benchmarks/groupby/group_no_requests_benchmark.cu
new file mode 100644
index 00000000000..7dbe1888cee
--- /dev/null
+++ b/cpp/benchmarks/groupby/group_no_requests_benchmark.cu
@@ -0,0 +1,115 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cudf/copying.hpp>
+#include <cudf/detail/aggregation/aggregation.hpp>
+#include <cudf/groupby.hpp>
+#include <cudf/sorting.hpp>
+#include <cudf/table/table.hpp>
+#include <cudf_test/column_wrapper.hpp>
+#include <fixture/benchmark_fixture.hpp>
+#include <synchronization/synchronization.hpp>
+
+#include <memory>
+#include <random>
+
+class Groupby : public cudf::benchmark {
+};
+
+// TODO: put it in a struct so `uniform` can be remade with different min, max
+template <typename T>
+T random_int(T min, T max)
+{
+  static unsigned seed = 13377331;
+  static std::mt19937 engine{seed};
+  static std::uniform_int_distribution<T> uniform{min, max};
+
+  return uniform(engine);
+}
+
+void BM_basic_no_requests(benchmark::State& state)
+{
+  using wrapper = cudf::test::fixed_width_column_wrapper<int64_t>;
+
+  const cudf::size_type column_size{(cudf::size_type)state.range(0)};
+
+  auto data_it = cudf::detail::make_counting_transform_iterator(
+    0, [=](cudf::size_type row) { return random_int(0, 100); });
+
+  wrapper keys(data_it, data_it + column_size);
+  wrapper vals(data_it, data_it + column_size);
+
+  std::vector<cudf::groupby::aggregation_request> requests;
+
+  for (auto _ : state) {
+    cuda_event_timer timer(state, true);
+    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));
+    auto result = gb_obj.aggregate(requests);
+  }
+}
+
+BENCHMARK_DEFINE_F(Groupby, BasicNoRequest)(::benchmark::State& state)
+{
+  BM_basic_no_requests(state);
+}
+
+BENCHMARK_REGISTER_F(Groupby, BasicNoRequest)
+  ->UseManualTime()
+  ->Unit(benchmark::kMillisecond)
+  ->Arg(10000)
+  ->Arg(1000000)
+  ->Arg(10000000)
+  ->Arg(100000000);
+
+void BM_pre_sorted_no_requests(benchmark::State& state)
+{
+  using wrapper = cudf::test::fixed_width_column_wrapper<int64_t>;
+
+  const cudf::size_type column_size{(cudf::size_type)state.range(0)};
+
+  auto data_it = cudf::detail::make_counting_transform_iterator(
+    0, [=](cudf::size_type row) { return random_int(0, 100); });
+  auto valid_it = cudf::detail::make_counting_transform_iterator(
+    0, [=](cudf::size_type row) { return random_int(0, 100) < 90; });
+
+  wrapper keys(data_it, data_it + column_size);
+  wrapper vals(data_it, data_it + column_size, valid_it);
+
+  auto keys_table  = cudf::table_view({keys});
+  auto sort_order  = cudf::sorted_order(keys_table);
+  auto sorted_keys = cudf::gather(keys_table, *sort_order);
+  // No need to sort values using sort_order because they were generated randomly
+
+  std::vector<cudf::groupby::aggregation_request> requests;
+
+  for (auto _ : state) {
+    cuda_event_timer timer(state, true);
+    cudf::groupby::groupby gb_obj(*sorted_keys, cudf::null_policy::EXCLUDE, cudf::sorted::YES);
+    auto result = gb_obj.aggregate(requests);
+  }
+}
+
+BENCHMARK_DEFINE_F(Groupby, PreSortedNoRequests)(::benchmark::State& state)
+{
+  BM_pre_sorted_no_requests(state);
+}
+
+BENCHMARK_REGISTER_F(Groupby, PreSortedNoRequests)
+  ->UseManualTime()
+  ->Unit(benchmark::kMillisecond)
+  ->Arg(1000000)
+  ->Arg(10000000)
+  ->Arg(100000000);
diff --git a/cpp/src/groupby/hash/groupby.cu b/cpp/src/groupby/hash/groupby.cu
index 1474cfd8a19..4f2cb4de14b 100644
--- a/cpp/src/groupby/hash/groupby.cu
+++ b/cpp/src/groupby/hash/groupby.cu
@@ -636,23 +636,16 @@ std::unique_ptr<table> groupby(table_view const& keys,
  */
 bool can_use_hash_groupby(table_view const& keys, host_span<aggregation_request const> requests)
 {
-  auto const all_hash_aggregations =
-    std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
-      return cudf::has_atomic_support(r.values.type()) and
-             std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
-               return is_hash_aggregation(a->kind);
-             });
-    });
-
-  // Currently, structs are not supported in any of hash-based aggregations.
-  // Therefore, if any request contains structs then we must fallback to sort-based aggregations.
-  // TODO: Support structs in hash-based aggregations.
-  auto const has_struct =
-    std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
-      return r.values.type().id() == type_id::STRUCT;
-    });
-
-  return all_hash_aggregations && !has_struct;
+  return std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
+    // Currently, structs are not supported in any of hash-based aggregations.
+    // Therefore, if any request contains structs then we must fallback to sort-based aggregations.
+    // TODO: Support structs in hash-based aggregations.
+    return not(r.values.type().id() == type_id::STRUCT) and
+           cudf::has_atomic_support(r.values.type()) and
+           std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
+             return is_hash_aggregation(a->kind);
+           });
+  });
 }
 
 // Hash-based groupby

From b3b299ae22f31adb8f380d7add1ce2bdb726ab26 Mon Sep 17 00:00:00 2001
From: Michael Wang <isVoid@users.noreply.github.com>
Date: Mon, 13 Dec 2021 20:00:37 -0800
Subject: [PATCH 2/4] Break tie for `top` categorical columns in
 `Series.describe` (#9867)

Closes #9825

This PR fixes the bug that when there are ties with most frequent categories in categorical series, the order returned is undefined. We break the tie by sorting the category and take its top.

Also included in this PR: refactors `describe` to use a dictionary items to describe for better readability. And enables https://github.com/rapidsai/cudf/blob/024003ca444f9d1a8374a1133337419f22cc880a/python/cudf/cudf/tests/test_series.py#L405

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: https://github.com/rapidsai/cudf/pull/9867
---
 python/cudf/cudf/core/series.py       | 144 ++++++++++++++------------
 python/cudf/cudf/tests/test_series.py |  37 +++----
 2 files changed, 93 insertions(+), 88 deletions(-)

diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py
index 3ca77105d1b..bbeae1adc5e 100644
--- a/python/cudf/cudf/core/series.py
+++ b/python/cudf/cudf/core/series.py
@@ -3312,51 +3312,55 @@ def _format_percentile_names(percentiles):
             return ["{0}%".format(int(x * 100)) for x in percentiles]
 
         def _format_stats_values(stats_data):
-            return list(map(lambda x: round(x, 6), stats_data))
+            return map(lambda x: round(x, 6), stats_data)
 
         def _describe_numeric(self):
             # mimicking pandas
-            index = (
-                ["count", "mean", "std", "min"]
-                + _format_percentile_names(percentiles)
-                + ["max"]
-            )
-            data = (
-                [self.count(), self.mean(), self.std(), self.min()]
-                + self.quantile(percentiles).to_numpy(na_value=np.nan).tolist()
-                + [self.max()]
-            )
-            data = _format_stats_values(data)
+            data = {
+                "count": self.count(),
+                "mean": self.mean(),
+                "std": self.std(),
+                "min": self.min(),
+                **dict(
+                    zip(
+                        _format_percentile_names(percentiles),
+                        self.quantile(percentiles)
+                        .to_numpy(na_value=np.nan)
+                        .tolist(),
+                    )
+                ),
+                "max": self.max(),
+            }
 
             return Series(
-                data=data, index=index, nan_as_null=False, name=self.name,
+                data=_format_stats_values(data.values()),
+                index=data.keys(),
+                nan_as_null=False,
+                name=self.name,
             )
 
         def _describe_timedelta(self):
             # mimicking pandas
-            index = (
-                ["count", "mean", "std", "min"]
-                + _format_percentile_names(percentiles)
-                + ["max"]
-            )
-
-            data = (
-                [
-                    str(self.count()),
-                    str(self.mean()),
-                    str(self.std()),
-                    str(pd.Timedelta(self.min())),
-                ]
-                + self.quantile(percentiles)
-                .astype("str")
-                .to_numpy(na_value=None)
-                .tolist()
-                + [str(pd.Timedelta(self.max()))]
-            )
+            data = {
+                "count": str(self.count()),
+                "mean": str(self.mean()),
+                "std": str(self.std()),
+                "min": str(pd.Timedelta(self.min())),
+                **dict(
+                    zip(
+                        _format_percentile_names(percentiles),
+                        self.quantile(percentiles)
+                        .astype("str")
+                        .to_numpy(na_value=np.nan)
+                        .tolist(),
+                    )
+                ),
+                "max": str(pd.Timedelta(self.max())),
+            }
 
             return Series(
-                data=data,
-                index=index,
+                data=data.values(),
+                index=data.keys(),
                 dtype="str",
                 nan_as_null=False,
                 name=self.name,
@@ -3365,51 +3369,55 @@ def _describe_timedelta(self):
         def _describe_categorical(self):
             # blocked by StringColumn/DatetimeColumn support for
             # value_counts/unique
-            index = ["count", "unique", "top", "freq"]
-            val_counts = self.value_counts(ascending=False)
-            data = [self.count(), self.unique().size]
-
-            if data[1] > 0:
-                top, freq = val_counts.index[0], val_counts.iloc[0]
-                data += [str(top), freq]
-            # If the DataFrame is empty, set 'top' and 'freq' to None
-            # to maintain output shape consistency
-            else:
-                data += [None, None]
+            data = {
+                "count": self.count(),
+                "unique": len(self.unique()),
+                "top": None,
+                "freq": None,
+            }
+            if data["count"] > 0:
+                # In case there's a tie, break the tie by sorting the index
+                # and take the top.
+                val_counts = self.value_counts(ascending=False)
+                tied_val_counts = val_counts[
+                    val_counts == val_counts.iloc[0]
+                ].sort_index()
+                data.update(
+                    {
+                        "top": tied_val_counts.index[0],
+                        "freq": tied_val_counts.iloc[0],
+                    }
+                )
 
             return Series(
-                data=data,
+                data=data.values(),
                 dtype="str",
-                index=index,
+                index=data.keys(),
                 nan_as_null=False,
                 name=self.name,
             )
 
         def _describe_timestamp(self):
-
-            index = (
-                ["count", "mean", "min"]
-                + _format_percentile_names(percentiles)
-                + ["max"]
-            )
-
-            data = (
-                [
-                    str(self.count()),
-                    str(self.mean().to_numpy().astype("datetime64[ns]")),
-                    str(pd.Timestamp(self.min().astype("datetime64[ns]"))),
-                ]
-                + self.quantile(percentiles)
-                .astype("str")
-                .to_numpy(na_value=None)
-                .tolist()
-                + [str(pd.Timestamp((self.max()).astype("datetime64[ns]")))]
-            )
+            data = {
+                "count": str(self.count()),
+                "mean": str(pd.Timestamp(self.mean())),
+                "min": str(pd.Timestamp(self.min())),
+                **dict(
+                    zip(
+                        _format_percentile_names(percentiles),
+                        self.quantile(percentiles)
+                        .astype(self.dtype)
+                        .astype("str")
+                        .to_numpy(na_value=np.nan),
+                    )
+                ),
+                "max": str(pd.Timestamp((self.max()))),
+            }
 
             return Series(
-                data=data,
+                data=data.values(),
                 dtype="str",
-                index=index,
+                index=data.keys(),
                 nan_as_null=False,
                 name=self.name,
             )
diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py
index f214e54c57e..1e11e862329 100644
--- a/python/cudf/cudf/tests/test_series.py
+++ b/python/cudf/cudf/tests/test_series.py
@@ -12,7 +12,6 @@
 
 import cudf
 from cudf.testing._utils import (
-    DATETIME_TYPES,
     NUMERIC_TYPES,
     TIMEDELTA_TYPES,
     assert_eq,
@@ -402,30 +401,21 @@ def test_series_describe_numeric(dtype):
     assert_eq(expected, actual)
 
 
-@pytest.mark.xfail(reason="https://github.com/rapidsai/cudf/issues/6219")
-@pytest.mark.parametrize("dtype", DATETIME_TYPES)
+@pytest.mark.parametrize("dtype", ["datetime64[ns]"])
 def test_series_describe_datetime(dtype):
+    # Note that other datetime units are not tested because pandas does not
+    # support them. When specified coarser units, cuDF datetime columns cannot
+    # represent fractional time for quantiles of the column, which may require
+    # interpolation, this differs from pandas which always stay in [ns] unit.
     gs = cudf.Series([0, 1, 2, 3, 1, 2, 3], dtype=dtype)
     ps = gs.to_pandas()
 
-    pdf_results = ps.describe(datetime_is_numeric=True)
-    gdf_results = gs.describe()
-
-    # Assert count
-    p_count = pdf_results["count"]
-    g_count = gdf_results["count"]
-
-    assert_eq(int(g_count), p_count)
-
-    # Assert Index
-    assert_eq(gdf_results.index, pdf_results.index)
+    # Treating datetimes as categoricals is deprecated in pandas and will
+    # be removed in future. Future behavior is treating datetime as numeric.
+    expected = ps.describe(datetime_is_numeric=True)
+    actual = gs.describe()
 
-    # Assert rest of the element apart from
-    # the first index('count')
-    actual = gdf_results.tail(-1).astype("datetime64[ns]")
-    expected = pdf_results.tail(-1).astype("str").astype("datetime64[ns]")
-
-    assert_eq(expected, actual)
+    assert_eq(expected.astype("str"), actual)
 
 
 @pytest.mark.parametrize("dtype", TIMEDELTA_TYPES)
@@ -446,6 +436,13 @@ def test_series_describe_timedelta(dtype):
         pd.Series([True, False, True, True, False]),
         pd.Series([], dtype="str"),
         pd.Series(["a", "b", "c", "a"], dtype="category"),
+        pd.Series(["d", "e", "f"], dtype="category"),
+        pd.Series(pd.Categorical(["d", "e", "f"], categories=["f", "e", "d"])),
+        pd.Series(
+            pd.Categorical(
+                ["d", "e", "f"], categories=["f", "e", "d"], ordered=True
+            )
+        ),
     ],
 )
 def test_series_describe_other_types(ps):

From 262715390f739075f0bdac01ff8c92206a1c2fb5 Mon Sep 17 00:00:00 2001
From: GALI PREM SAGAR <sagarprem75@gmail.com>
Date: Tue, 14 Dec 2021 13:58:28 -0600
Subject: [PATCH 3/4] Change default `dtype` of all nulls column from `float`
 to `object` (#9803)

Fixes: #9337

- [x] This PR changes the default `dtype` of `all-nulls` column to `object` dtype from `float64` dtype.
- [x] To make `np.nan` values read as `float` column `nan_as_null` has to be passed as `False` in `cudf.DataFrame` constructor - This change is in-line with what is already supported by `cudf.Series` constructor.
- [x] Added `has_nans` & `nan_count` property which is needed for some of the checks.
- [x] Cached the `nan_count` since it is repeatedly used in math operations and clearing the cache in the regular `_clear_cache` call.
- [x] Fixes pytests that are going to break due to this breaking change of types.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - https://github.com/brandon-b-miller
  - Ashwin Srinath (https://github.com/shwina)

URL: https://github.com/rapidsai/cudf/pull/9803
---
 python/cudf/cudf/_lib/column.pyi              |  3 +-
 python/cudf/cudf/_lib/column.pyx              |  3 +-
 python/cudf/cudf/core/_base_index.py          |  2 +-
 python/cudf/cudf/core/column/column.py        | 34 ++++++----
 python/cudf/cudf/core/column/datetime.py      |  2 +-
 python/cudf/cudf/core/column/numerical.py     | 64 +++++++++++++++++--
 .../cudf/cudf/core/column/numerical_base.py   |  9 ++-
 python/cudf/cudf/core/column/string.py        | 15 +----
 python/cudf/cudf/core/dataframe.py            | 44 +++++++++----
 python/cudf/cudf/core/frame.py                |  2 +-
 python/cudf/cudf/core/index.py                |  2 +-
 python/cudf/cudf/core/multiindex.py           |  2 +-
 python/cudf/cudf/core/series.py               |  2 +-
 python/cudf/cudf/core/tools/datetimes.py      |  2 +-
 python/cudf/cudf/core/tools/numeric.py        |  2 +-
 python/cudf/cudf/core/window/rolling.py       |  2 +-
 python/cudf/cudf/tests/test_dataframe.py      | 54 +++++++++++++---
 python/cudf/cudf/tests/test_interpolate.py    |  8 ++-
 python/cudf/cudf/tests/test_list.py           | 15 +++--
 python/cudf/cudf/tests/test_onehot.py         | 12 +++-
 python/cudf/cudf/tests/test_repr.py           |  6 +-
 python/cudf/cudf/tests/test_series.py         |  2 +-
 python/cudf/cudf/tests/test_stats.py          |  5 +-
 23 files changed, 210 insertions(+), 82 deletions(-)

diff --git a/python/cudf/cudf/_lib/column.pyi b/python/cudf/cudf/_lib/column.pyi
index dafaa8f4d1d..235cb4fd973 100644
--- a/python/cudf/cudf/_lib/column.pyi
+++ b/python/cudf/cudf/_lib/column.pyi
@@ -70,8 +70,7 @@ class Column:
     def nullable(self) -> bool:
         ...
 
-    @property
-    def has_nulls(self) -> bool:
+    def has_nulls(self, include_nan: bool=False) -> bool:
         ...
 
     @property
diff --git a/python/cudf/cudf/_lib/column.pyx b/python/cudf/cudf/_lib/column.pyx
index ff3f3050e63..5e0ee3136b7 100644
--- a/python/cudf/cudf/_lib/column.pyx
+++ b/python/cudf/cudf/_lib/column.pyx
@@ -146,8 +146,7 @@ cdef class Column:
     def nullable(self):
         return self.base_mask is not None
 
-    @property
-    def has_nulls(self):
+    def has_nulls(self, include_nan=False):
         return self.null_count != 0
 
     @property
diff --git a/python/cudf/cudf/core/_base_index.py b/python/cudf/cudf/core/_base_index.py
index 2fcc976d8e1..ac5e152d011 100644
--- a/python/cudf/cudf/core/_base_index.py
+++ b/python/cudf/cudf/core/_base_index.py
@@ -147,7 +147,7 @@ def _clean_nulls_from_index(self):
         methods using this method to replace or handle representation
         of the actual types correctly.
         """
-        if self._values.has_nulls:
+        if self._values.has_nulls():
             return cudf.Index(
                 self._values.astype("str").fillna(cudf._NA_REP), name=self.name
             )
diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py
index 1d113f6e159..a98052ce906 100644
--- a/python/cudf/cudf/core/column/column.py
+++ b/python/cudf/cudf/core/column/column.py
@@ -139,7 +139,7 @@ def values_host(self) -> "np.ndarray":
         if len(self) == 0:
             return np.array([], dtype=self.dtype)
 
-        if self.has_nulls:
+        if self.has_nulls():
             raise ValueError("Column must have no nulls.")
 
         return self.data_array_view.copy_to_host()
@@ -152,7 +152,7 @@ def values(self) -> "cupy.ndarray":
         if len(self) == 0:
             return cupy.array([], dtype=self.dtype)
 
-        if self.has_nulls:
+        if self.has_nulls():
             raise ValueError("Column must have no nulls.")
 
         return cupy.asarray(self.data_array_view)
@@ -193,7 +193,7 @@ def all(self, skipna: bool = True) -> bool:
     def any(self, skipna: bool = True) -> bool:
         # Early exit for fast cases.
         result_col = self.nans_to_nulls() if skipna else self
-        if not skipna and result_col.has_nulls:
+        if not skipna and result_col.has_nulls():
             return True
         elif skipna and result_col.null_count == result_col.size:
             return False
@@ -786,7 +786,7 @@ def as_mask(self) -> Buffer:
         Buffer
         """
 
-        if self.has_nulls:
+        if self.has_nulls():
             raise ValueError("Column must have no nulls.")
 
         return bools_to_mask(self)
@@ -797,13 +797,13 @@ def is_unique(self) -> bool:
 
     @property
     def is_monotonic_increasing(self) -> bool:
-        return not self.has_nulls and self.as_frame()._is_sorted(
+        return not self.has_nulls() and self.as_frame()._is_sorted(
             ascending=None, null_position=None
         )
 
     @property
     def is_monotonic_decreasing(self) -> bool:
-        return not self.has_nulls and self.as_frame()._is_sorted(
+        return not self.has_nulls() and self.as_frame()._is_sorted(
             ascending=[False], null_position=None
         )
 
@@ -942,7 +942,7 @@ def as_categorical_column(self, dtype, **kwargs) -> ColumnBase:
         )
 
         # columns include null index in factorization; remove:
-        if self.has_nulls:
+        if self.has_nulls():
             cats = cats._column.dropna(drop_nan=False)
             min_type = min_unsigned_type(len(cats), 8)
             labels = labels - 1
@@ -1216,10 +1216,10 @@ def _process_for_reduction(
 
         if skipna:
             result_col = self.nans_to_nulls()
-            if result_col.has_nulls:
+            if result_col.has_nulls():
                 result_col = result_col.dropna()
         else:
-            if self.has_nulls:
+            if self.has_nulls():
                 return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
 
             result_col = self
@@ -1766,12 +1766,20 @@ def as_column(
                 "https://issues.apache.org/jira/browse/ARROW-3802"
             )
         col = ColumnBase.from_arrow(arbitrary)
+
         if isinstance(arbitrary, pa.NullArray):
-            if type(dtype) == str and dtype == "empty":
-                new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype())
+            new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype())
+            if dtype is not None:
+                # Cast the column to the `dtype` if specified.
+                col = col.astype(dtype)
+            elif len(arbitrary) == 0:
+                # If the column is empty, it has to be
+                # a `float64` dtype.
+                col = col.astype("float64")
             else:
-                new_dtype = cudf.dtype(dtype)
-            col = col.astype(new_dtype)
+                # If the null column is not empty, it has to
+                # be of `object` dtype.
+                col = col.astype(new_dtype)
 
         return col
 
diff --git a/python/cudf/cudf/core/column/datetime.py b/python/cudf/cudf/core/column/datetime.py
index 08d72f1c6ee..24ec25acbbb 100644
--- a/python/cudf/cudf/core/column/datetime.py
+++ b/python/cudf/cudf/core/column/datetime.py
@@ -292,7 +292,7 @@ def __cuda_array_interface__(self) -> Mapping[builtins.str, Any]:
             "version": 1,
         }
 
-        if self.nullable and self.has_nulls:
+        if self.nullable and self.has_nulls():
 
             # Create a simple Python object that exposes the
             # `__cuda_array_interface__` attribute here since we need to modify
diff --git a/python/cudf/cudf/core/column/numerical.py b/python/cudf/cudf/core/column/numerical.py
index becb303feeb..c947440edb1 100644
--- a/python/cudf/cudf/core/column/numerical.py
+++ b/python/cudf/cudf/core/column/numerical.py
@@ -3,7 +3,16 @@
 from __future__ import annotations
 
 from types import SimpleNamespace
-from typing import Any, Callable, Mapping, Sequence, Tuple, Union, cast
+from typing import (
+    Any,
+    Callable,
+    Mapping,
+    Optional,
+    Sequence,
+    Tuple,
+    Union,
+    cast,
+)
 
 import cupy
 import numpy as np
@@ -47,6 +56,8 @@ class NumericalColumn(NumericalBaseColumn):
     mask : Buffer, optional
     """
 
+    _nan_count: Optional[int]
+
     def __init__(
         self,
         data: Buffer,
@@ -62,7 +73,7 @@ def __init__(
             raise ValueError("Buffer size must be divisible by element size")
         if size is None:
             size = (data.size // dtype.itemsize) - offset
-
+        self._nan_count = None
         super().__init__(
             data,
             size=size,
@@ -72,6 +83,10 @@ def __init__(
             null_count=null_count,
         )
 
+    def _clear_cache(self):
+        super()._clear_cache()
+        self._nan_count = None
+
     def __contains__(self, item: ScalarLike) -> bool:
         """
         Returns True if column contains item, else False.
@@ -90,6 +105,11 @@ def __contains__(self, item: ScalarLike) -> bool:
             self, column.as_column([item], dtype=self.dtype)
         ).any()
 
+    def has_nulls(self, include_nan=False):
+        return self.null_count != 0 or (
+            self.nan_count != 0 if include_nan else False
+        )
+
     @property
     def __cuda_array_interface__(self) -> Mapping[str, Any]:
         output = {
@@ -100,7 +120,7 @@ def __cuda_array_interface__(self) -> Mapping[str, Any]:
             "version": 1,
         }
 
-        if self.nullable and self.has_nulls:
+        if self.nullable and self.has_nulls():
 
             # Create a simple Python object that exposes the
             # `__cuda_array_interface__` attribute here since we need to modify
@@ -280,6 +300,15 @@ def as_numerical_column(self, dtype: Dtype, **kwargs) -> NumericalColumn:
             return self
         return libcudf.unary.cast(self, dtype)
 
+    @property
+    def nan_count(self) -> int:
+        if self.dtype.kind != "f":
+            self._nan_count = 0
+        elif self._nan_count is None:
+            nan_col = libcudf.unary.is_nan(self)
+            self._nan_count = nan_col.sum()
+        return self._nan_count
+
     def _process_values_for_isin(
         self, values: Sequence
     ) -> Tuple[ColumnBase, ColumnBase]:
@@ -296,6 +325,20 @@ def _process_values_for_isin(
 
         return lhs, rhs
 
+    def _can_return_nan(self, skipna: bool = None) -> bool:
+        return not skipna and self.has_nulls(include_nan=True)
+
+    def _process_for_reduction(
+        self, skipna: bool = None, min_count: int = 0
+    ) -> Union[ColumnBase, ScalarLike]:
+        skipna = True if skipna is None else skipna
+
+        if self._can_return_nan(skipna=skipna):
+            return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
+        return super()._process_for_reduction(
+            skipna=skipna, min_count=min_count
+        )
+
     def _default_na_value(self) -> ScalarLike:
         """Returns the default NA value for this column"""
         dkind = self.dtype.kind
@@ -319,8 +362,19 @@ def find_and_replace(
         """
         Return col with *to_replace* replaced with *value*.
         """
+
+        # If all of `to_replace`/`replacement` are `None`,
+        # dtype of `to_replace_col`/`replacement_col`
+        # is inferred as `string`, but this is a valid
+        # float64 column too, Hence we will need to type-cast
+        # to self.dtype.
         to_replace_col = column.as_column(to_replace)
+        if to_replace_col.null_count == len(to_replace_col):
+            to_replace_col = to_replace_col.astype(self.dtype)
+
         replacement_col = column.as_column(replacement)
+        if replacement_col.null_count == len(replacement_col):
+            replacement_col = replacement_col.astype(self.dtype)
 
         if type(to_replace_col) != type(replacement_col):
             raise TypeError(
@@ -578,7 +632,7 @@ def to_pandas(
             arrow_array = self.to_arrow()
             pandas_array = pandas_nullable_dtype.__from_arrow__(arrow_array)
             pd_series = pd.Series(pandas_array, copy=False)
-        elif str(self.dtype) in NUMERIC_TYPES and not self.has_nulls:
+        elif str(self.dtype) in NUMERIC_TYPES and not self.has_nulls():
             pd_series = pd.Series(cupy.asnumpy(self.values), copy=False)
         else:
             pd_series = self.to_arrow().to_pandas(**kwargs)
@@ -597,6 +651,8 @@ def _normalize_find_and_replace_input(
     )
     col_to_normalize_dtype = normalized_column.dtype
     if isinstance(col_to_normalize, list):
+        if normalized_column.null_count == len(normalized_column):
+            normalized_column = normalized_column.astype(input_column_dtype)
         col_to_normalize_dtype = min_column_type(
             normalized_column, input_column_dtype
         )
diff --git a/python/cudf/cudf/core/column/numerical_base.py b/python/cudf/cudf/core/column/numerical_base.py
index 853fb360c50..1f84cb88e37 100644
--- a/python/cudf/cudf/core/column/numerical_base.py
+++ b/python/cudf/cudf/core/column/numerical_base.py
@@ -77,10 +77,13 @@ def sum_of_squares(
             "sum_of_squares", skipna=skipna, dtype=dtype, min_count=min_count
         )
 
+    def _can_return_nan(self, skipna: bool = None) -> bool:
+        return not skipna and self.has_nulls()
+
     def kurtosis(self, skipna: bool = None) -> float:
         skipna = True if skipna is None else skipna
 
-        if len(self) == 0 or (not skipna and self.has_nulls):
+        if len(self) == 0 or self._can_return_nan(skipna=skipna):
             return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
 
         self = self.nans_to_nulls().dropna()  # type: ignore
@@ -105,7 +108,7 @@ def kurtosis(self, skipna: bool = None) -> float:
     def skew(self, skipna: bool = None) -> ScalarLike:
         skipna = True if skipna is None else skipna
 
-        if len(self) == 0 or (not skipna and self.has_nulls):
+        if len(self) == 0 or self._can_return_nan(skipna=skipna):
             return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
 
         self = self.nans_to_nulls().dropna()  # type: ignore
@@ -148,7 +151,7 @@ def quantile(
     def median(self, skipna: bool = None) -> NumericalBaseColumn:
         skipna = True if skipna is None else skipna
 
-        if not skipna and self.has_nulls:
+        if self._can_return_nan(skipna=skipna):
             return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
 
         # enforce linear in case the default ever changes
diff --git a/python/cudf/cudf/core/column/string.py b/python/cudf/cudf/core/column/string.py
index 2a91abc5701..1c9a013810a 100644
--- a/python/cudf/cudf/core/column/string.py
+++ b/python/cudf/cudf/core/column/string.py
@@ -5018,7 +5018,7 @@ def edit_distance_matrix(self) -> ParentType:
             raise ValueError(
                 "Require size >= 2 to compute edit distance matrix."
             )
-        if self._column.has_nulls:
+        if self._column.has_nulls():
             raise ValueError(
                 "Cannot compute edit distance between null strings. "
                 "Consider removing them using `dropna` or fill with `fillna`."
@@ -5440,20 +5440,7 @@ def find_and_replace(
         """
 
         to_replace_col = column.as_column(to_replace)
-        if to_replace_col.null_count == len(to_replace_col):
-            # If all of `to_replace` are `None`, dtype of `to_replace_col`
-            # is inferred as `float64`, but this is a valid
-            # string column too, Hence we will need to type-cast
-            # to self.dtype.
-            to_replace_col = to_replace_col.astype(self.dtype)
-
         replacement_col = column.as_column(replacement)
-        if replacement_col.null_count == len(replacement_col):
-            # If all of `replacement` are `None`, dtype of `replacement_col`
-            # is inferred as `float64`, but this is a valid
-            # string column too, Hence we will need to type-cast
-            # to self.dtype.
-            replacement_col = replacement_col.astype(self.dtype)
 
         if type(to_replace_col) != type(replacement_col):
             raise TypeError(
diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py
index bbe691595e7..88c8aaebd9e 100644
--- a/python/cudf/cudf/core/dataframe.py
+++ b/python/cudf/cudf/core/dataframe.py
@@ -440,6 +440,11 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin):
         Data type to force. Only a single dtype is allowed.
         If None, infer.
 
+    nan_as_null : bool, Default True
+        If ``None``/``True``, converts ``np.nan`` values to
+        ``null`` values.
+        If ``False``, leaves ``np.nan`` values as is.
+
     Examples
     --------
 
@@ -514,7 +519,9 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin):
     _iloc_indexer_type = _DataFrameIlocIndexer
 
     @annotate("DATAFRAME_INIT", color="blue", domain="cudf_python")
-    def __init__(self, data=None, index=None, columns=None, dtype=None):
+    def __init__(
+        self, data=None, index=None, columns=None, dtype=None, nan_as_null=True
+    ):
 
         super().__init__()
 
@@ -523,7 +530,7 @@ def __init__(self, data=None, index=None, columns=None, dtype=None):
 
         if isinstance(data, (DataFrame, pd.DataFrame)):
             if isinstance(data, pd.DataFrame):
-                data = self.from_pandas(data)
+                data = self.from_pandas(data, nan_as_null=nan_as_null)
 
             if index is not None:
                 if not data.index.equals(index):
@@ -546,11 +553,14 @@ def __init__(self, data=None, index=None, columns=None, dtype=None):
                 self.columns = data.columns
         elif isinstance(data, (cudf.Series, pd.Series)):
             if isinstance(data, pd.Series):
-                data = cudf.Series.from_pandas(data)
+                data = cudf.Series.from_pandas(data, nan_as_null=nan_as_null)
 
             name = data.name or 0
             self._init_from_dict_like(
-                {name: data}, index=index, columns=columns
+                {name: data},
+                index=index,
+                columns=columns,
+                nan_as_null=nan_as_null,
             )
         elif data is None:
             if index is None:
@@ -620,7 +630,9 @@ def __init__(self, data=None, index=None, columns=None, dtype=None):
                 if not is_dict_like(data):
                     raise TypeError("data must be list or dict-like")
 
-                self._init_from_dict_like(data, index=index, columns=columns)
+                self._init_from_dict_like(
+                    data, index=index, columns=columns, nan_as_null=nan_as_null
+                )
 
         if dtype:
             self._data = self.astype(dtype)._data
@@ -759,7 +771,9 @@ def _init_from_list_like(self, data, index=None, columns=None):
 
             self.columns = columns
 
-    def _init_from_dict_like(self, data, index=None, columns=None):
+    def _init_from_dict_like(
+        self, data, index=None, columns=None, nan_as_null=None
+    ):
         if columns is not None:
             # remove all entries in `data` that are
             # not in `columns`
@@ -794,7 +808,9 @@ def _init_from_dict_like(self, data, index=None, columns=None):
                 if is_scalar(data[col_name]):
                     num_rows = num_rows or 1
                 else:
-                    data[col_name] = column.as_column(data[col_name])
+                    data[col_name] = column.as_column(
+                        data[col_name], nan_as_null=nan_as_null
+                    )
                     num_rows = len(data[col_name])
             self._index = RangeIndex(0, num_rows)
         else:
@@ -806,7 +822,9 @@ def _init_from_dict_like(self, data, index=None, columns=None):
                 self._data.multiindex = self._data.multiindex and isinstance(
                     col_name, tuple
                 )
-                self.insert(i, col_name, data[col_name])
+                self.insert(
+                    i, col_name, data[col_name], nan_as_null=nan_as_null
+                )
 
         if columns is not None:
             self.columns = columns
@@ -1747,7 +1765,7 @@ def _clean_nulls_from_dataframe(self, df):
             if is_list_dtype(df._data[col]) or is_struct_dtype(df._data[col]):
                 # TODO we need to handle this
                 pass
-            elif df._data[col].has_nulls:
+            elif df._data[col].has_nulls():
                 df[col] = df._data[col].astype("str").fillna(cudf._NA_REP)
             else:
                 df[col] = df._data[col]
@@ -2582,7 +2600,7 @@ def take(self, indices, axis=0, keep_index=None):
         return out
 
     @annotate("INSERT", color="green", domain="cudf_python")
-    def insert(self, loc, name, value):
+    def insert(self, loc, name, value, nan_as_null=None):
         """Add a column to DataFrame at the index specified by loc.
 
         Parameters
@@ -2625,11 +2643,11 @@ def insert(self, loc, name, value):
                         )
                 self._data = new_data
         elif isinstance(value, (pd.Series, Series)):
-            value = Series(value)._align_to_index(
+            value = Series(value, nan_as_null=nan_as_null)._align_to_index(
                 self._index, how="right", sort=False
             )
 
-        value = column.as_column(value)
+        value = column.as_column(value, nan_as_null=nan_as_null)
 
         self._data.insert(name, value, loc=loc)
 
@@ -3081,7 +3099,7 @@ def as_gpu_matrix(self, columns=None, order="F"):
 
         dtype = find_common_type([col.dtype for col in cols])
         for k, c in self._data.items():
-            if c.has_nulls:
+            if c.has_nulls():
                 raise ValueError(
                     f"column '{k}' has null values. "
                     f"hint: use .fillna() to replace null values"
diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py
index 61ce64e7d6b..c85ed0c8555 100644
--- a/python/cudf/cudf/core/frame.py
+++ b/python/cudf/cudf/core/frame.py
@@ -4755,7 +4755,7 @@ def _scan(self, op, axis=None, skipna=True, cast_to_int=False):
                 result_col = self._data[name].nans_to_nulls()
             else:
                 result_col = self._data[name].copy()
-                if result_col.has_nulls:
+                if result_col.has_nulls(include_nan=True):
                     # Workaround as find_first_value doesn't seem to work
                     # incase of bools.
                     first_index = int(
diff --git a/python/cudf/cudf/core/index.py b/python/cudf/cudf/core/index.py
index 0002aaf38c5..29e0d17bc39 100644
--- a/python/cudf/cudf/core/index.py
+++ b/python/cudf/cudf/core/index.py
@@ -2515,7 +2515,7 @@ def _clean_nulls_from_index(self):
         Convert all na values(if any) in Index object
         to `<NA>` as a preprocessing step to `__repr__` methods.
         """
-        if self._values.has_nulls:
+        if self._values.has_nulls():
             return self.fillna(cudf._NA_REP)
         else:
             return self
diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py
index 65c79b4cf59..c403c697e3d 100644
--- a/python/cudf/cudf/core/multiindex.py
+++ b/python/cudf/cudf/core/multiindex.py
@@ -386,7 +386,7 @@ def __repr__(self):
         else:
             preprocess = self.copy(deep=False)
 
-        if any(col.has_nulls for col in preprocess._data.columns):
+        if any(col.has_nulls() for col in preprocess._data.columns):
             preprocess_df = preprocess.to_frame(index=False)
             for name, col in preprocess._data.items():
                 if isinstance(
diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py
index bbeae1adc5e..036c8c1ee00 100644
--- a/python/cudf/cudf/core/series.py
+++ b/python/cudf/cudf/core/series.py
@@ -1474,7 +1474,7 @@ def has_nulls(self):
         >>> series.dropna().has_nulls
         False
         """
-        return self._column.has_nulls
+        return self._column.has_nulls()
 
     def dropna(self, axis=0, inplace=False, how=None):
         """
diff --git a/python/cudf/cudf/core/tools/datetimes.py b/python/cudf/cudf/core/tools/datetimes.py
index 34d62ffc048..3efbd982b53 100644
--- a/python/cudf/cudf/core/tools/datetimes.py
+++ b/python/cudf/cudf/core/tools/datetimes.py
@@ -330,7 +330,7 @@ def _process_col(col, unit, dayfirst, infer_datetime_format, format):
             col = col.as_datetime_column(dtype=_unit_dtype_map[unit])
 
     elif col.dtype.kind in ("O"):
-        if unit not in (None, "ns"):
+        if unit not in (None, "ns") or col.null_count == len(col):
             try:
                 col = col.astype(dtype="int64")
             except ValueError:
diff --git a/python/cudf/cudf/core/tools/numeric.py b/python/cudf/cudf/core/tools/numeric.py
index 7c688b92009..bd1b505c57f 100644
--- a/python/cudf/cudf/core/tools/numeric.py
+++ b/python/cudf/cudf/core/tools/numeric.py
@@ -165,7 +165,7 @@ def to_numeric(arg, errors="raise", downcast=None):
     if isinstance(arg, (cudf.Series, pd.Series)):
         return cudf.Series(col)
     else:
-        if col.has_nulls:
+        if col.has_nulls():
             # To match pandas, always return a floating type filled with nan.
             col = col.astype(float).fillna(np.nan)
         return col.values
diff --git a/python/cudf/cudf/core/window/rolling.py b/python/cudf/cudf/core/window/rolling.py
index 617dbdeaea5..0f4256e49a6 100644
--- a/python/cudf/cudf/core/window/rolling.py
+++ b/python/cudf/cudf/core/window/rolling.py
@@ -326,7 +326,7 @@ def apply(self, func, *args, **kwargs):
         """
         has_nulls = False
         if isinstance(self.obj, cudf.Series):
-            if self.obj._column.has_nulls:
+            if self.obj._column.has_nulls():
                 has_nulls = True
         else:
             for col in self.obj._data:
diff --git a/python/cudf/cudf/tests/test_dataframe.py b/python/cudf/cudf/tests/test_dataframe.py
index c40f9f0b0a5..ab0856fad1e 100644
--- a/python/cudf/cudf/tests/test_dataframe.py
+++ b/python/cudf/cudf/tests/test_dataframe.py
@@ -246,7 +246,7 @@ def test_series_init_none():
     sr1 = cudf.Series()
     got = sr1.to_string()
 
-    expect = "Series([], dtype: float64)"
+    expect = sr1.to_pandas().__repr__()
     # values should match despite whitespace difference
     assert got.split() == expect.split()
 
@@ -254,7 +254,7 @@ def test_series_init_none():
     sr2 = cudf.Series(None)
     got = sr2.to_string()
 
-    expect = "Series([], dtype: float64)"
+    expect = sr2.to_pandas().__repr__()
     # values should match despite whitespace difference
     assert got.split() == expect.split()
 
@@ -1308,7 +1308,7 @@ def test_concat_empty_dataframe(df_1, df_2):
     [
         {"a": [1, 2], "b": [1, 2], "c": ["s1", "s2"], "d": [1.0, 2.0]},
         {"b": [1.9, 10.9], "c": ["s1", "s2"]},
-        {"c": ["s1"], "b": [None], "a": [False]},
+        {"c": ["s1"], "b": pd.Series([None], dtype="float"), "a": [False]},
     ],
 )
 @pytest.mark.parametrize(
@@ -2008,8 +2008,8 @@ def test_dataframe_count_reduction(data, func):
         {"x": [np.nan, 2, 3, 4, 100, np.nan], "y": [4, 5, 6, 88, 99, np.nan]},
         {"x": [1, 2, 3], "y": [4, 5, 6]},
         {"x": [np.nan, np.nan, np.nan], "y": [np.nan, np.nan, np.nan]},
-        {"x": [], "y": []},
-        {"x": []},
+        {"x": pd.Series([], dtype="float"), "y": pd.Series([], dtype="float")},
+        {"x": pd.Series([], dtype="int")},
     ],
 )
 @pytest.mark.parametrize("ops", ["sum", "product", "prod"])
@@ -2017,7 +2017,7 @@ def test_dataframe_count_reduction(data, func):
 @pytest.mark.parametrize("min_count", [-10, -1, 0, 1, 2, 3, 10])
 def test_dataframe_min_count_ops(data, ops, skipna, min_count):
     psr = pd.DataFrame(data)
-    gsr = cudf.DataFrame(data)
+    gsr = cudf.from_pandas(psr)
 
     assert_eq(
         getattr(psr, ops)(skipna=skipna, min_count=min_count),
@@ -2498,7 +2498,7 @@ def test_series_all_null(num_elements, null_type):
 
     # Typecast Pandas because None will return `object` dtype
     expect = pd.Series(data, dtype="float64")
-    got = cudf.Series(data)
+    got = cudf.Series(data, dtype="float64")
 
     assert_eq(expect, got)
 
@@ -8480,10 +8480,10 @@ def test_agg_for_dataframe_with_string_columns(aggs):
 )
 def test_update_for_dataframes(data, data2, join, overwrite, errors):
     pdf = pd.DataFrame(data)
-    gdf = cudf.DataFrame(data)
+    gdf = cudf.DataFrame(data, nan_as_null=False)
 
     other_pd = pd.DataFrame(data2)
-    other_gd = cudf.DataFrame(data2)
+    other_gd = cudf.DataFrame(data2, nan_as_null=False)
 
     pdf.update(other=other_pd, join=join, overwrite=overwrite, errors=errors)
     gdf.update(other=other_gd, join=join, overwrite=overwrite, errors=errors)
@@ -8949,7 +8949,9 @@ def test_frame_series_where_other(data):
         (
             {
                 "id": ["a", "a", "b", "b", "c", "c"],
-                "val": [None, None, None, None, None, None],
+                "val": cudf.Series(
+                    [None, None, None, None, None, None], dtype="float64"
+                ),
             },
             ["id"],
         ),
@@ -9041,6 +9043,38 @@ def test_pearson_corr_multiindex_dataframe():
     assert_eq(expected, actual)
 
 
+@pytest.mark.parametrize(
+    "data",
+    [
+        {"a": [np.nan, 1, 2], "b": [None, None, None]},
+        {"a": [1, 2, np.nan, 2], "b": [np.nan, np.nan, np.nan, np.nan]},
+        {
+            "a": [1, 2, np.nan, 2, None],
+            "b": [np.nan, np.nan, None, np.nan, np.nan],
+        },
+        {"a": [1, 2, 2, None, 1.1], "b": [1, 2.2, 3, None, 5]},
+    ],
+)
+@pytest.mark.parametrize("nan_as_null", [True, False])
+def test_dataframe_constructor_nan_as_null(data, nan_as_null):
+    actual = cudf.DataFrame(data, nan_as_null=nan_as_null)
+
+    if nan_as_null:
+        assert (
+            not (
+                actual.astype("float").replace(
+                    cudf.Series([np.nan], nan_as_null=False), cudf.Series([-1])
+                )
+                == -1
+            )
+            .any()
+            .any()
+        )
+    else:
+        actual = actual.select_dtypes(exclude=["object"])
+        assert (actual.replace(np.nan, -1) == -1).any().any()
+
+
 def test_dataframe_add_prefix():
     cdf = cudf.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]})
     pdf = cdf.to_pandas()
diff --git a/python/cudf/cudf/tests/test_interpolate.py b/python/cudf/cudf/tests/test_interpolate.py
index 66556c48828..2c544dfc17c 100644
--- a/python/cudf/cudf/tests/test_interpolate.py
+++ b/python/cudf/cudf/tests/test_interpolate.py
@@ -50,7 +50,9 @@ def test_interpolate_series(data, method, axis):
     expect = psr.interpolate(method=method, axis=axis)
     got = gsr.interpolate(method=method, axis=axis)
 
-    assert_eq(expect, got)
+    assert_eq(
+        expect, got, check_dtype=False if psr.dtype == "object" else True
+    )
 
 
 @pytest.mark.parametrize(
@@ -88,7 +90,9 @@ def test_interpolate_series_values_or_index(data, index, method):
     expect = psr.interpolate(method=method)
     got = gsr.interpolate(method=method)
 
-    assert_eq(expect, got)
+    assert_eq(
+        expect, got, check_dtype=False if psr.dtype == "object" else True
+    )
 
 
 @pytest.mark.parametrize(
diff --git a/python/cudf/cudf/tests/test_list.py b/python/cudf/cudf/tests/test_list.py
index 2b71ca7ac36..b898222d7d7 100644
--- a/python/cudf/cudf/tests/test_list.py
+++ b/python/cudf/cudf/tests/test_list.py
@@ -76,10 +76,14 @@ def test_leaves(data):
     pa_array = pa.array(data)
     while hasattr(pa_array, "flatten"):
         pa_array = pa_array.flatten()
-    dtype = "int8" if isinstance(pa_array, pa.NullArray) else None
-    expect = cudf.Series(pa_array, dtype=dtype)
+
+    expect = cudf.Series(pa_array)
     got = cudf.Series(data).list.leaves
-    assert_eq(expect, got)
+    assert_eq(
+        expect,
+        got,
+        check_dtype=False if isinstance(pa_array, pa.NullArray) else True,
+    )
 
 
 def test_list_to_pandas_nullable_true():
@@ -269,7 +273,10 @@ def test_get(data, index, expect):
     sr = cudf.Series(data)
     expect = cudf.Series(expect)
     got = sr.list.get(index)
-    assert_eq(expect, got)
+
+    assert_eq(
+        expect, got, check_dtype=False if expect.isnull().all() else True
+    )
 
 
 def test_get_nested_lists():
diff --git a/python/cudf/cudf/tests/test_onehot.py b/python/cudf/cudf/tests/test_onehot.py
index ed55fb86820..f2a20a73b63 100644
--- a/python/cudf/cudf/tests/test_onehot.py
+++ b/python/cudf/cudf/tests/test_onehot.py
@@ -113,10 +113,18 @@ def test_get_dummies(data):
     encoded_expected = pd.get_dummies(pdf, prefix="test")
     encoded_actual = cudf.get_dummies(gdf, prefix="test")
 
-    utils.assert_eq(encoded_expected, encoded_actual)
+    utils.assert_eq(
+        encoded_expected,
+        encoded_actual,
+        check_dtype=False if len(data) == 0 else True,
+    )
     encoded_actual = cudf.get_dummies(gdf, prefix="test", dtype=np.uint8)
 
-    utils.assert_eq(encoded_expected, encoded_actual)
+    utils.assert_eq(
+        encoded_expected,
+        encoded_actual,
+        check_dtype=False if len(data) == 0 else True,
+    )
 
 
 @pytest.mark.parametrize("n_cols", [5, 10, 20])
diff --git a/python/cudf/cudf/tests/test_repr.py b/python/cudf/cudf/tests/test_repr.py
index 736bcf131cc..fe95b2930df 100644
--- a/python/cudf/cudf/tests/test_repr.py
+++ b/python/cudf/cudf/tests/test_repr.py
@@ -328,10 +328,14 @@ def test_dataframe_sliced(gdf, slice, max_seq_items, max_rows):
         ),
         (
             cudf.Index([None, None, None], name="hello"),
+            "StringIndex([None None None], dtype='object', name='hello')",
+        ),
+        (
+            cudf.Index([None, None, None], dtype="float", name="hello"),
             "Float64Index([<NA>, <NA>, <NA>], dtype='float64', name='hello')",
         ),
         (
-            cudf.Index([None], name="hello"),
+            cudf.Index([None], dtype="float64", name="hello"),
             "Float64Index([<NA>], dtype='float64', name='hello')",
         ),
         (
diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py
index 1e11e862329..583d2c7a8dd 100644
--- a/python/cudf/cudf/tests/test_series.py
+++ b/python/cudf/cudf/tests/test_series.py
@@ -971,7 +971,7 @@ def test_series_update(data, other):
 @pytest.mark.parametrize("nan_as_null", [True, False])
 @pytest.mark.parametrize("fill_value", [1.2, 332, np.nan])
 def test_fillna_with_nan(data, nan_as_null, fill_value):
-    gs = cudf.Series(data, nan_as_null=nan_as_null)
+    gs = cudf.Series(data, dtype="float64", nan_as_null=nan_as_null)
     ps = gs.to_pandas()
 
     expected = ps.fillna(fill_value)
diff --git a/python/cudf/cudf/tests/test_stats.py b/python/cudf/cudf/tests/test_stats.py
index ebe78d56c3f..142ca6c6831 100644
--- a/python/cudf/cudf/tests/test_stats.py
+++ b/python/cudf/cudf/tests/test_stats.py
@@ -460,7 +460,8 @@ def test_df_corr():
 @pytest.mark.parametrize("skipna", [True, False, None])
 def test_nans_stats(data, ops, skipna):
     psr = cudf.utils.utils._create_pandas_series(data=data)
-    gsr = cudf.Series(data)
+    gsr = cudf.Series(data, nan_as_null=False)
+
     assert_eq(
         getattr(psr, ops)(skipna=skipna), getattr(gsr, ops)(skipna=skipna)
     )
@@ -486,7 +487,7 @@ def test_nans_stats(data, ops, skipna):
 @pytest.mark.parametrize("min_count", [-10, -1, 0, 1, 2, 3, 5, 10])
 def test_min_count_ops(data, ops, skipna, min_count):
     psr = pd.Series(data)
-    gsr = cudf.Series(data)
+    gsr = cudf.Series(data, nan_as_null=False)
 
     assert_eq(
         getattr(psr, ops)(skipna=skipna, min_count=min_count),

From 7a23f1a01547648db7ad684fa3dc0482b7ac813f Mon Sep 17 00:00:00 2001
From: David Wendt <45795991+davidwendt@users.noreply.github.com>
Date: Tue, 14 Dec 2021 15:28:59 -0500
Subject: [PATCH 4/4] Add utility to format ninja-log build times (#9631)

Reference: https://github.com/rapidsai/ops/issues/1896

Generate build times log from formatted, sorted `.ninja_log` file.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert Maynard (https://github.com/robertmaynard)
  - Karthikeyan (https://github.com/karthikeyann)

URL: https://github.com/rapidsai/cudf/pull/9631
---
 build.sh                      |  18 +++++
 ci/gpu/build.sh               |  12 +++-
 cpp/scripts/sort_ninja_log.py | 121 ++++++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100755 cpp/scripts/sort_ninja_log.py

diff --git a/build.sh b/build.sh
index d0ccd4821e0..adf6e220744 100755
--- a/build.sh
+++ b/build.sh
@@ -172,6 +172,12 @@ if buildAll || hasArg libcudf; then
         echo "Building for *ALL* supported GPU architectures..."
     fi
 
+    # get the current count before the compile starts
+    FILES_IN_CCACHE=""
+    if [ -x "$(command -v ccache)" ]; then
+        FILES_IN_CCACHE=$(ccache -s | grep "files in cache")
+    fi
+
     cmake -S $REPODIR/cpp -B ${LIB_BUILD_DIR} \
           -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} \
           ${CUDF_CMAKE_CUDA_ARCHITECTURES} \
@@ -185,7 +191,19 @@ if buildAll || hasArg libcudf; then
 
     cd ${LIB_BUILD_DIR}
 
+    compile_start=$(date +%s)
     cmake --build . -j${PARALLEL_LEVEL} ${VERBOSE_FLAG}
+    compile_end=$(date +%s)
+    compile_total=$(( compile_end - compile_start ))
+
+    # Record build times
+    if [[ -f "${LIB_BUILD_DIR}/.ninja_log" ]]; then
+        echo "Formatting build times"
+        python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt xml > ${LIB_BUILD_DIR}/ninja_log.xml
+        message="$FILES_IN_CCACHE <p>$PARALLEL_LEVEL parallel build time is $compile_total seconds"
+        echo "$message"
+        python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt html --msg "$message" > ${LIB_BUILD_DIR}/ninja_log.html
+    fi
 
     if [[ ${INSTALL_TARGET} != "" ]]; then
         cmake --build . -j${PARALLEL_LEVEL} --target install ${VERBOSE_FLAG}
diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index d8b5cc7ba4c..00ad6bf812d 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -166,16 +166,26 @@ else
     gpuci_logger "Check GPU usage"
     nvidia-smi
 
-    gpuci_logger "GoogleTests"
     set -x
     cd $LIB_BUILD_DIR
 
+    gpuci_logger "GoogleTests"
+
     for gt in gtests/* ; do
         test_name=$(basename ${gt})
         echo "Running GoogleTest $test_name"
         ${gt} --gtest_output=xml:"$WORKSPACE/test-results/"
     done
 
+    # Copy libcudf build time results
+    echo "Checking for build time log $LIB_BUILD_DIR/ninja_log.html"
+    if [[ -f "$LIB_BUILD_DIR/ninja_log.html" ]]; then
+        gpuci_logger "Copying build time results"
+        cp "$LIB_BUILD_DIR/ninja_log.xml" "$WORKSPACE/test-results/buildtimes-junit.xml"
+        mkdir -p "$WORKSPACE/build-metrics"
+        cp "$LIB_BUILD_DIR/ninja_log.html" "$WORKSPACE/build-metrics/BuildMetrics.html"
+    fi
+
     ################################################################################
     # MEMCHECK - Run compute-sanitizer on GoogleTest (only in nightly builds)
     ################################################################################
diff --git a/cpp/scripts/sort_ninja_log.py b/cpp/scripts/sort_ninja_log.py
new file mode 100755
index 00000000000..5eada13aea2
--- /dev/null
+++ b/cpp/scripts/sort_ninja_log.py
@@ -0,0 +1,121 @@
+#
+# Copyright (c) 2021, NVIDIA CORPORATION.
+#
+import argparse
+import os
+import sys
+import xml.etree.ElementTree as ET
+from xml.dom import minidom
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "log_file", type=str, default=".ninja_log", help=".ninja_log file"
+)
+parser.add_argument(
+    "--fmt",
+    type=str,
+    default="csv",
+    choices=["csv", "xml", "html"],
+    help="output format (to stdout)",
+)
+parser.add_argument(
+    "--msg",
+    type=str,
+    default=None,
+    help="optional message to include in html output",
+)
+args = parser.parse_args()
+
+log_file = args.log_file
+log_path = os.path.dirname(os.path.abspath(log_file))
+
+output_fmt = args.fmt
+
+# build a map of the log entries
+entries = {}
+with open(log_file, "r") as log:
+    for line in log:
+        entry = line.split()
+        if len(entry) > 4:
+            elapsed = int(entry[1]) - int(entry[0])
+            obj_file = entry[3]
+            file_size = (
+                os.path.getsize(os.path.join(log_path, obj_file))
+                if os.path.exists(obj_file)
+                else 0
+            )
+            entries[entry[3]] = (elapsed, file_size)
+
+# check file could be loaded
+if len(entries) == 0:
+    print("Could not parse", log_file)
+    exit()
+
+# sort the keys by build time (descending order)
+keys = list(entries.keys())
+sl = sorted(keys, key=lambda k: entries[k][0], reverse=True)
+
+if output_fmt == "xml":
+    # output results in XML format
+    root = ET.Element("testsuites")
+    testsuite = ET.Element(
+        "testsuite",
+        attrib={
+            "name": "build-time",
+            "tests": str(len(keys)),
+            "failures": str(0),
+            "errors": str(0),
+        },
+    )
+    root.append(testsuite)
+    for key in sl:
+        entry = entries[key]
+        elapsed = float(entry[0]) / 1000
+        item = ET.Element(
+            "testcase",
+            attrib={
+                "classname": "BuildTime",
+                "name": key,
+                "time": str(elapsed),
+            },
+        )
+        testsuite.append(item)
+
+    tree = ET.ElementTree(root)
+    xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent="   ")
+    print(xmlstr)
+
+elif output_fmt == "html":
+    # output results in HTML format
+    print("<html><head><title>Sorted Ninja Build Times</title>")
+    print("<style>", "table, th, td { border:1px solid black; }", "</style>")
+    print("</head><body>")
+    if args.msg is not None:
+        print("<p>", args.msg, "</p>")
+    print("<table>")
+    print(
+        "<tr><th>File</th>",
+        "<th align='right'>Compile time (ms)</th>",
+        "<th align='right'>Size (bytes)</th><tr>",
+        sep="",
+    )
+    for key in sl:
+        result = entries[key]
+        print(
+            "<tr><td>",
+            key,
+            "</td><td align='right'>",
+            result[0],
+            "</td><td align='right'>",
+            result[1],
+            "</td></tr>",
+            sep="",
+        )
+    print("</table></body></html>")
+
+else:
+    # output results in CSV format
+    print("time,size,file")
+    for key in sl:
+        result = entries[key]
+        print(result[0], result[1], key, sep=",")