Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Row table is consuming more memory than expected #43129

Closed
zanmato1984 opened this issue Jul 3, 2024 · 1 comment
Closed

[C++][Compute] Row table is consuming more memory than expected #43129

zanmato1984 opened this issue Jul 3, 2024 · 1 comment

Comments

@zanmato1984
Copy link
Contributor

Describe the enhancement requested

When I was investigating #43116 , found that row table consumed more memory than expected. For example, the questioning test case in #41336 encodes data a little more than 2GB, but the actual buffer takes 8GB, which is expected to be 4GB (with the growing strategy being double the current size).

Component(s)

C++

@zanmato1984 zanmato1984 self-assigned this Jul 3, 2024
pitrou pushed a commit that referenced this issue Jul 10, 2024
…s when encoding row table (#43125)

### Rationale for this change

As described in #43129 , current row table occupies more memory than expected. The memory consumption is double of necessary. The reason listed below.

When encoding var length columns into into the row table:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/encode_internal.cc#L155-L162

We first call `AppendEmpty` to reserve space for `x` rows but `0` bytes. This is to reserve enough size for the underlying fixed-length buffers: null masks and offsets (for var-length columns).

Then we call `GetRowOffsetsSelected` to populate the offsets.

At last we call `AppendEmpty` again with `0` rows but `y` bytes, where `y` is the last offset element which is essentially the whole size of the var-length columns.

Sounds all reasonable so far.

However, `AppendEmpty` calls `ResizeOptionalVaryingLengthBuffer`, in which:
https://github.com/apache/arrow/blob/e59832fb05dc40a85fa63297c77c8f134c9ac8e0/cpp/src/arrow/compute/row/row_internal.cc#L294-L303

We calculate `bytes_capacity_new` by keeping doubling it until it's big enough for `num_bytes + num_extra_bytes`.

Note by the time of this point, `num_bytes == offsets()[num_rows_]` is already `y`, meanwhile `num_extra_bytes` is also `y`, hence the unexpected doubled size than necessary.

### What changes are included in this PR?

Fix the wasted half size for buffers in row table. Also add tests to make sure the buffer size is as expected.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

None.

* GitHub Issue: #43129

Authored-by: Ruoxi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou
Copy link
Member

pitrou commented Jul 10, 2024

Issue resolved by pull request 43125
#43125

@pitrou pitrou added this to the 18.0.0 milestone Jul 10, 2024
@pitrou pitrou closed this as completed Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants