Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for run-end encoded array #507

Merged
merged 11 commits into from
Jun 7, 2024

Conversation

cocoa-xu
Copy link
Contributor

@cocoa-xu cocoa-xu commented Jun 4, 2024

Hi this PR tries to add support for run-end encoded array based on the arrow spec here, https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clean and I look forward to giving it a closer look tomorrow!

@felipecrv if you have the bandwidth, would you mind taking a look through this for the details of the run-end encoding?

Copy link

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review with some quick thoughts.

@cocoa-xu
Copy link
Contributor Author

cocoa-xu commented Jun 5, 2024

Hi @felipecrv, many thanks for the code review and these examples! I've adapted some code from the code snippets in the review, and it should handle the validation properly now. :)

@cocoa-xu cocoa-xu force-pushed the cx-run-end-encoded branch from 037d31a to 89a0f1d Compare June 5, 2024 10:41
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again (and thanks to Felipe for doing a first pass on the details)! I have a few comments here for discussion 🙂

array->length = 0;
struct ArrowBuffer* data_buffer;
data_buffer = ArrowArrayBuffer(array->children[0], 1);
for (int64_t i = 0; i < array->children[0]->length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that you need a loop here? (i.e., is it possible to just extract the last run end?)

I wonder if this would be a better fit for ArrowArrayFinishBuilding()? Or maybe we need FinishAppending()? I am not sure I would have thought of FinishElement to do these checks/updates since they really only need to happen once per array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ArrowArrayFinishRunEndEncoded might be better as it's more suggestive from its name and less confusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but it is also very similar to the piece of code we have that updates the null_count based on the validity buffer, which I think happens in ArrowArrayFinishBuilding() (although that "feature" has caused me some personal confusion because I sometimes forget that it happens and wonder why the null_count isn't what I set it to be).

The most backward compatible thing to do would be to not do any updating of lengths (i.e., force the caller to set the length of the parent run-end-encoded array). This would not necessarily be difficult to do because they would have to be keeping track of that length to append to the run-ends child (so they would have to keep a counter somewhere anyway). Perhaps for this PR the helper could be omitted and we can learn from the experience of implementing this elsewhere what the best intervention would be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for this PR the helper could be omitted and we can learn from the experience of implementing this elsewhere what the best intervention would be?

No problem, I've removed it :)

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 69.56522% with 28 lines in your changes missing coverage. Please review.

Project coverage is 88.55%. Comparing base (8894ebf) to head (0ce34e5).
Report is 10 commits behind head on main.

Files Patch % Lines
src/nanoarrow/array.c 55.55% 28 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #507      +/-   ##
==========================================
- Coverage   89.21%   88.55%   -0.66%     
==========================================
  Files          90       90              
  Lines       16294    16541     +247     
==========================================
+ Hits        14536    14648     +112     
- Misses       1758     1893     +135     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cocoa-xu
Copy link
Contributor Author

cocoa-xu commented Jun 5, 2024

Just added more tests to cover more lines and the memory issue in the schema test should be fixed.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, but this looks good from my end!

@felipecrv are there any more REE-specific details relevant to this PR (or follow-ups that we need to open issues for and tackle before the next release?)

@felipecrv
Copy link

Just a few nits, but this looks good from my end!

@felipecrv are there any more REE-specific details relevant to this PR (or follow-ups that we need to open issues for and tackle before the next release?)

Not that I'm aware right now. I recommend that you and @cocoa-xu take a look at the C++ code around REEs to see all the traps. I don't remember everything that I did, but I left many comments in the C++ implementation.

https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/ree_util.h
and
https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/ree_util.cc
are good sources to look at.

@paleolimbot paleolimbot merged commit c4d20a0 into apache:main Jun 7, 2024
32 checks passed
@cocoa-xu cocoa-xu deleted the cx-run-end-encoded branch June 7, 2024 19:59
paleolimbot added a commit that referenced this pull request Jun 10, 2024
Tests for #507 and
#501 and/or
#503 either used C++17 or
features from Arrow C++ > 12. Our test suite still supports these
(although perhaps parts of this support should be dropped soon).

On Windows, formatting with `%lu` was doing some unexpected formatting.
We could do a better job formatting 64-bit integers in error messages
(e.g., using `PRId64` and the requisite defines to ensure it works on
mingw); however, we probably won't ever be able to support properly
formatting an unsigned 64-bit integer on every platform we support. I
changed the error message (and its test) slightly to reflect that.
@WillAyd
Copy link
Contributor

WillAyd commented Jun 11, 2024

I think this PR introduced a regression that valgrind is complaining about. You can see it started back in the weekly Meson build:

https://github.com/apache/arrow-nanoarrow/actions/runs/9432588616

A more detailed error description can be found here:

https://github.com/apache/arrow-nanoarrow/actions/runs/9457996609/job/26052769002?pr=483#step:7:319

Apparently Valgrind does not like something that ultimately hits this codeblock from the RunEndEncoded tests:

if (array_view->storage_type == NANOARROW_TYPE_RUN_END_ENCODED) {

}
last_run_end = run_end;
}
last_run_end = ArrowArrayViewGetIntUnsafe(run_ends_view, run_ends_view->length - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_run_end = ArrowArrayViewGetIntUnsafe(run_ends_view, run_ends_view->length - 1);

Is there any chance of run_ends_view->length being 0 here? If so I think this might invoke UB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll send a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants