GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table #43125

zanmato1984 · 2024-07-03T06:24:54Z

Rationale for this change

As described in #43129 , current row table occupies more memory than expected. The memory consumption is double of necessary. The reason listed below.

When encoding var length columns into into the row table:

arrow/cpp/src/arrow/compute/row/encode_internal.cc

Lines 155 to 162 in e59832f

    
           RETURN_NOT_OK( 
        
               rows->AppendEmpty(static_cast<uint32_t>(num_selected), static_cast<uint32_t>(0))); 
        
           EncoderOffsets::GetRowOffsetsSelected(rows, batch_varbinary_cols_, num_selected, 
        
                                                 selection); 
        
           RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0), 
        
                                           static_cast<uint32_t>(rows->offsets()[num_selected])));

We first call AppendEmpty to reserve space for x rows but 0 bytes. This is to reserve enough size for the underlying fixed-length buffers: null masks and offsets (for var-length columns).

Then we call GetRowOffsetsSelected to populate the offsets.

At last we call AppendEmpty again with 0 rows but y bytes, where y is the last offset element which is essentially the whole size of the var-length columns.

Sounds all reasonable so far.

However, AppendEmpty calls ResizeOptionalVaryingLengthBuffer, in which:

arrow/cpp/src/arrow/compute/row/row_internal.cc

Lines 294 to 303 in e59832f

    
           Status RowTableImpl::ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes) { 
        
             int64_t num_bytes = offsets()[num_rows_]; 
        
             if (bytes_capacity_ >= num_bytes + num_extra_bytes || metadata_.is_fixed_length) { 
        
               return Status::OK(); 
        
             } 
        
             int64_t bytes_capacity_new = std::max(static_cast<int64_t>(1), 2 * bytes_capacity_); 
        
             while (bytes_capacity_new < num_bytes + num_extra_bytes) { 
        
               bytes_capacity_new *= 2; 
        
             }

We calculate bytes_capacity_new by keeping doubling it until it's big enough for num_bytes + num_extra_bytes.

Note by the time of this point, num_bytes == offsets()[num_rows_] is already y, meanwhile num_extra_bytes is also y, hence the unexpected doubled size than necessary.

What changes are included in this PR?

Fix the wasted half size for buffers in row table. Also add tests to make sure the buffer size is as expected.

Are these changes tested?

UT included.

Are there any user-facing changes?

None.

GitHub Issue: [C++][Compute] Row table is consuming more memory than expected #43129

github-actions · 2024-07-03T06:25:17Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

zanmato1984 · 2024-07-03T06:26:32Z

@github-actions crossbow submit -g cpp

github-actions · 2024-07-03T06:29:05Z

Revision: 7b6682c

Submitted crossbow builds: ursacomputing/crossbow @ actions-ea31ebc24b

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

zanmato1984 · 2024-07-03T06:49:54Z

@github-actions crossbow submit -g cpp

github-actions · 2024-07-03T06:52:26Z

Revision: 8619d9e

Submitted crossbow builds: ursacomputing/crossbow @ actions-adf44107b5

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

zanmato1984 · 2024-07-03T07:13:04Z

@github-actions crossbow submit -g cpp

github-actions · 2024-07-03T07:15:28Z

Revision: daf6d15

Submitted crossbow builds: ursacomputing/crossbow @ actions-29d3b99634

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

zanmato1984 · 2024-07-03T15:25:43Z

Hi @pitrou @westonpace , mind to take a look? This fix reduces the memory footprint of row table almost by half so could be a major improvement. Thanks.

zanmato1984 · 2024-07-03T15:29:55Z

cpp/src/arrow/compute/row/row_test.cc

+
+TEST(RowTableMemoryConsumption, Encode) {
+  constexpr int64_t num_rows_max = 8192;
+  constexpr int64_t padding_for_vectors = 64;


Refers to this

arrow/cpp/src/arrow/compute/row/row_internal.h

Line 227 in b044a51

static constexpr int64_t kPaddingForVectors = 64;

which is always appended to the needed capacity when resizing buffers.

pitrou

Sorry for the delay @zanmato1984 !

can you rebase?
please find comments below

pitrou · 2024-07-09T15:09:10Z

cpp/src/arrow/compute/row/encode_internal.cc

@@ -158,8 +158,7 @@ Status RowTableEncoder::EncodeSelected(RowTableImpl* rows, uint32_t num_selected
  EncoderOffsets::GetRowOffsetsSelected(rows, batch_varbinary_cols_, num_selected,
                                        selection);

-  RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0),
-                                  static_cast<uint32_t>(rows->offsets()[num_selected])));
+  RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0), static_cast<uint32_t>(0)));


Can we add parameter names to make sure we understand what's being passed?

Suggested change

RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0), static_cast<uint32_t>(0)));

RETURN_NOT_OK(rows->AppendEmpty(/*xxx=*/ static_cast<uint32_t>(0), /*yyy=*/ static_cast<uint32_t>(0)));

Yes. Will do.

pitrou · 2024-07-09T15:12:34Z

cpp/src/arrow/compute/row/row_test.cc

+      int64_t actual_null_mask_size =
+          num_rows * row_table.metadata().null_masks_bytes_per_row;
+      ASSERT_GT(actual_null_mask_size * 2,
+                row_table.buffer_size(0) - padding_for_vectors);


Should we also check that the buffer size is large enough? Same for other inequalities below.

Good catch, will add them.

pitrou · 2024-07-09T15:13:29Z

cpp/src/arrow/compute/row/row_test.cc

+}  // namespace
+
+TEST(RowTableMemoryConsumption, Encode) {
+  constexpr int64_t num_rows_max = 8192;


Can you add a comment and GH issue reference to explain what this test is checking for?

Sure, will do.

zanmato1984 · 2024-07-09T16:19:35Z

Rebase done.

conbench-apache-arrow · 2024-07-10T18:49:27Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3b7ad9d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 47 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 marked this pull request as draft July 3, 2024 06:25

github-actions bot added Component: C++ awaiting review Awaiting review labels Jul 3, 2024

zanmato1984 force-pushed the fix-row-table-memory-consumption branch from 7b6682c to 342a91d Compare July 3, 2024 06:27

zanmato1984 force-pushed the fix-row-table-memory-consumption branch from daf6d15 to b8ff724 Compare July 3, 2024 07:32

zanmato1984 changed the title ~~[C++][Compute] Fix the unnecessary extra bytes when encoding row table~~ GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table Jul 3, 2024

zanmato1984 marked this pull request as ready for review July 3, 2024 15:24

zanmato1984 commented Jul 3, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 3, 2024

pitrou requested changes Jul 9, 2024

View reviewed changes

zanmato1984 added 6 commits July 9, 2024 23:43

Fix the unnecessary allocation of extra bytes when encoding row table

06e9417

Add test

ee4f43a

Refine

5fb3567

Cleanup

219bcb2

More cleanup

daccdc2

Address comments

185a73c

zanmato1984 force-pushed the fix-row-table-memory-consumption branch from f202c1a to 185a73c Compare July 9, 2024 16:17

Fix warning

b03a973

pitrou approved these changes Jul 10, 2024

View reviewed changes

pitrou merged commit 3b7ad9d into apache:main Jul 10, 2024
39 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Jul 10, 2024

pitrou mentioned this pull request Jul 10, 2024

[C++][Compute] Row table is consuming more memory than expected #43129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table #43125

GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table #43125

zanmato1984 commented Jul 3, 2024 •

edited

Loading

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

zanmato1984 Jul 3, 2024

pitrou left a comment

pitrou Jul 9, 2024

zanmato1984 Jul 9, 2024

zanmato1984 Jul 9, 2024

pitrou Jul 9, 2024

zanmato1984 Jul 9, 2024

zanmato1984 Jul 9, 2024

pitrou Jul 9, 2024

zanmato1984 Jul 9, 2024

zanmato1984 Jul 9, 2024

zanmato1984 commented Jul 9, 2024

conbench-apache-arrow bot commented Jul 10, 2024

	RETURN_NOT_OK(
	rows->AppendEmpty(static_cast<uint32_t>(num_selected), static_cast<uint32_t>(0)));

	EncoderOffsets::GetRowOffsetsSelected(rows, batch_varbinary_cols_, num_selected,
	selection);

	RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0),
	static_cast<uint32_t>(rows->offsets()[num_selected])));

	Status RowTableImpl::ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes) {
	int64_t num_bytes = offsets()[num_rows_];
	if (bytes_capacity_ >= num_bytes + num_extra_bytes \|\| metadata_.is_fixed_length) {
	return Status::OK();
	}

	int64_t bytes_capacity_new = std::max(static_cast<int64_t>(1), 2 * bytes_capacity_);
	while (bytes_capacity_new < num_bytes + num_extra_bytes) {
	bytes_capacity_new *= 2;
	}

	RETURN_NOT_OK(rows->AppendEmpty(static_cast<uint32_t>(0), static_cast<uint32_t>(0)));
	RETURN_NOT_OK(rows->AppendEmpty(/xxx=/ static_cast<uint32_t>(0), /yyy=/ static_cast<uint32_t>(0)));

GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table #43125

GH-43129: [C++][Compute] Fix the unnecessary allocation of extra bytes when encoding row table #43125

Conversation

zanmato1984 commented Jul 3, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

zanmato1984 commented Jul 3, 2024

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 commented Jul 9, 2024

conbench-apache-arrow bot commented Jul 10, 2024

zanmato1984 commented Jul 3, 2024 •

edited

Loading