Skip to content

Commit

Permalink
Reduce memory usage of as_categorical_column
Browse files Browse the repository at this point in the history
The main culprit is in the way the codes returned from _label_encoding
were being ordered. We were generating an int64 column for the order,
gathering through the left gather map, and then argsorting, before
using that ordering as a gather map for the codes.

We note that gather(y, with=argsort(x)) is equivalent to
sort_by_key(y, with=x) so use that instead (avoiding an unnecessary
gather). Furthermore we also note that gather([0..n), with=x) is just
equivalent to x, so we can avoid a gather too.

This reduces the peak memory footprint of categorifying a random
column of 500_000_000 int32 values where there are 100 unique values
from 24.75 GiB to 11.67 GiB.
  • Loading branch information
wence- committed Sep 20, 2023
1 parent 63d197f commit 2dac975
Showing 1 changed file with 8 additions and 9 deletions.
17 changes: 8 additions & 9 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1390,20 +1390,19 @@ def _return_sentinel_column():
except ValueError:
return _return_sentinel_column()

codes = arange(len(cats), dtype=dtype)
left_gather_map, right_gather_map = cpp_join(
[self], [cats], how="left"
)
codes = codes.take(
right_gather_map, nullify=True, check_bounds=False
).fillna(na_sentinel.value)

codes = libcudf.copying.gather(
[arange(len(cats), dtype=dtype)], right_gather_map, nullify=True
)
del right_gather_map
# reorder `codes` so that its values correspond to the
# values of `self`:
order = arange(len(self))
order = order.take(left_gather_map, check_bounds=False).argsort()
codes = codes.take(order)
return codes
(codes,) = libcudf.sort.sort_by_key(
codes, [left_gather_map], [True], ["last"], stable=True
)
return codes.fillna(na_sentinel.value)


def column_empty_like(
Expand Down

0 comments on commit 2dac975

Please sign in to comment.