-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40062: [C++][Python] Conversion of Table to Arrow Tensor #41870
base: main
Are you sure you want to change the base?
Conversation
Benchmarks for
|
…to tensor.cc (#41932) ### Rationale for this change This is a precursor PR to #41870 with the purpose to make the review of #41870 easier (the diff of the code will be visible as it currently isn't because the code was moved to table.cc. I should also live in tensor.cc). ### What changes are included in this PR? The code from `RecordBatch::ToTensor` in record_batch.cc is moved to `RecordBatchToTensor` in tensor.cc. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? No. **This PR does not close the linked issue yet, it is just a precursor!** * GitHub Issue: #40062 Authored-by: AlenkaF <[email protected]> Signed-off-by: AlenkaF <[email protected]>
13c49a7
to
15574f8
Compare
…ch.cc to tensor.cc (apache#41932) ### Rationale for this change This is a precursor PR to apache#41870 with the purpose to make the review of apache#41870 easier (the diff of the code will be visible as it currently isn't because the code was moved to table.cc. I should also live in tensor.cc). ### What changes are included in this PR? The code from `RecordBatch::ToTensor` in record_batch.cc is moved to `RecordBatchToTensor` in tensor.cc. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? No. **This PR does not close the linked issue yet, it is just a precursor!** * GitHub Issue: apache#40062 Authored-by: AlenkaF <[email protected]> Signed-off-by: AlenkaF <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! Some minor comments, and wondering if we can reduce the duplication in testing a bit
@@ -1219,6 +1219,295 @@ def test_recordbatch_to_tensor_unsupported(): | |||
batch.to_tensor() | |||
|
|||
|
|||
@pytest.mark.parametrize('typ', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are adding a lot of code to this test file, while in practice it's not really testing anything new. I am wondering if we could either parametrize some existing tests for RecordBatch to run with both RecordBatch and Table (we have some other tests in this file that are parametrized that way, see eg test_table_basics
), or otherwise only test some unique aspects of Table.to_tensor (the fact that the method exists and works, a case with multiple chunks), but for the other aspects (type promotion, null handling, etc) rely on the record batch tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added parametrization with RecordBatch
and Table
for mixed type test, others I left to rely on RecordBatch
only. Then I kept one test for table with multiple chunks (test_table_to_tensor_uniform_type): feab6e1
…rk for Arrow Tables
15574f8
to
4decf7f
Compare
I have researched the benchmark regression a bit and found that:
benchmark diff output
Plan to also try profiling in python ( |
…st for both batch and table
Do you see those regressions of up to 40% for both row major and column major conversions? And both for uniform vs mixed type with casting? |
Benchmarks for RecordBatch only test row-major conversion. The newly added Table benchmarks test both. I think that was due to the fact we were adding features for RecordBatch::ToTensor step by step and we needed one simple benchmark that we could check while adding the features. Row-major conversion was the last to be added. As for the types, we only test uniform types in C++ benchmarks at the moment. ps: haven't been able to find extract any information with neither py-spy nor cProfile. |
Rationale for this change
There is currently no method to convert Arrow Table to Arrow Tensor (conversion from columnar format to a contiguous block of memory). This work is a continuation of
RecordBatch::ToTensor
work, see #40058.What changes are included in this PR?
This PR:
Table::ToTensor
conversionRecordBatch::ToTensor
and uses the Table implementation (RecordBatch::ToTensor
benchmarks checked)Are these changes tested?
Yes, in C++ and Python.
Are there any user-facing changes?
No, it is a new feature.