Reduce memory usage of as_categorical_column

The main culprit is in the way the codes returned from _label_encoding were being ordered. We were generating an int64 column for the order, gathering through the left gather map, and then argsorting, before using that ordering as a gather map for the codes. We note that gather(y, with=argsort(x)) is equivalent to sort_by_key(y, with=x) so use that instead (avoiding an unnecessary gather). Furthermore we also note that gather([0..n), with=x) is just equivalent to x, so we can avoid a gather too. This reduces the peak memory footprint of categorifying a random column of 500_000_000 int32 values where there are 100 unique values from 24.75 GiB to 11.67 GiB.
rapidsai · Sep 20, 2023 · 2dac975 · 2dac975
1 parent 63d197f
commit 2dac975
Showing 1 changed file with 8 additions and 9 deletions.
diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py
@@ -1390,20 +1390,19 @@ def _return_sentinel_column():
         except ValueError:
             return _return_sentinel_column()
 
-        codes = arange(len(cats), dtype=dtype)
         left_gather_map, right_gather_map = cpp_join(
             [self], [cats], how="left"
         )
-        codes = codes.take(
-            right_gather_map, nullify=True, check_bounds=False
-        ).fillna(na_sentinel.value)
-
+        codes = libcudf.copying.gather(
+            [arange(len(cats), dtype=dtype)], right_gather_map, nullify=True
+        )
+        del right_gather_map
         # reorder `codes` so that its values correspond to the
         # values of `self`:
-        order = arange(len(self))
-        order = order.take(left_gather_map, check_bounds=False).argsort()
-        codes = codes.take(order)
-        return codes
+        (codes,) = libcudf.sort.sort_by_key(
+            codes, [left_gather_map], [True], ["last"], stable=True
+        )
+        return codes.fillna(na_sentinel.value)
 
 
 def column_empty_like(