You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I was investigating #43116 , found that row table consumed more memory than expected. For example, the questioning test case in #41336 encodes data a little more than 2GB, but the actual buffer takes 8GB, which is expected to be 4GB (with the growing strategy being double the current size).
Component(s)
C++
The text was updated successfully, but these errors were encountered:
…s when encoding row table (#43125)
### Rationale for this change
As described in #43129 , current row table occupies more memory than expected. The memory consumption is double of necessary. The reason listed below.
When encoding var length columns into into the row table:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/encode_internal.cc#L155-L162
We first call `AppendEmpty` to reserve space for `x` rows but `0` bytes. This is to reserve enough size for the underlying fixed-length buffers: null masks and offsets (for var-length columns).
Then we call `GetRowOffsetsSelected` to populate the offsets.
At last we call `AppendEmpty` again with `0` rows but `y` bytes, where `y` is the last offset element which is essentially the whole size of the var-length columns.
Sounds all reasonable so far.
However, `AppendEmpty` calls `ResizeOptionalVaryingLengthBuffer`, in which:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/row_internal.cc#L294-L303
We calculate `bytes_capacity_new` by keeping doubling it until it's big enough for `num_bytes + num_extra_bytes`.
Note by the time of this point, `num_bytes == offsets()[num_rows_]` is already `y`, meanwhile `num_extra_bytes` is also `y`, hence the unexpected doubled size than necessary.
### What changes are included in this PR?
Fix the wasted half size for buffers in row table. Also add tests to make sure the buffer size is as expected.
### Are these changes tested?
UT included.
### Are there any user-facing changes?
None.
* GitHub Issue: #43129
Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Describe the enhancement requested
When I was investigating #43116 , found that row table consumed more memory than expected. For example, the questioning test case in #41336 encodes data a little more than 2GB, but the actual buffer takes 8GB, which is expected to be 4GB (with the growing strategy being double the current size).
Component(s)
C++
The text was updated successfully, but these errors were encountered: