-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for run-end encoded array #507
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clean and I look forward to giving it a closer look tomorrow!
@felipecrv if you have the bandwidth, would you mind taking a look through this for the details of the run-end encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review with some quick thoughts.
Hi @felipecrv, many thanks for the code review and these examples! I've adapted some code from the code snippets in the review, and it should handle the validation properly now. :) |
037d31a
to
89a0f1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again (and thanks to Felipe for doing a first pass on the details)! I have a few comments here for discussion 🙂
src/nanoarrow/array_inline.h
Outdated
array->length = 0; | ||
struct ArrowBuffer* data_buffer; | ||
data_buffer = ArrowArrayBuffer(array->children[0], 1); | ||
for (int64_t i = 0; i < array->children[0]->length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that you need a loop here? (i.e., is it possible to just extract the last run end?)
I wonder if this would be a better fit for ArrowArrayFinishBuilding()
? Or maybe we need FinishAppending()
? I am not sure I would have thought of FinishElement
to do these checks/updates since they really only need to happen once per array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ArrowArrayFinishRunEndEncoded
might be better as it's more suggestive from its name and less confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but it is also very similar to the piece of code we have that updates the null_count
based on the validity buffer, which I think happens in ArrowArrayFinishBuilding()
(although that "feature" has caused me some personal confusion because I sometimes forget that it happens and wonder why the null_count
isn't what I set it to be).
The most backward compatible thing to do would be to not do any updating of lengths (i.e., force the caller to set the length of the parent run-end-encoded array). This would not necessarily be difficult to do because they would have to be keeping track of that length to append to the run-ends child (so they would have to keep a counter somewhere anyway). Perhaps for this PR the helper could be omitted and we can learn from the experience of implementing this elsewhere what the best intervention would be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps for this PR the helper could be omitted and we can learn from the experience of implementing this elsewhere what the best intervention would be?
No problem, I've removed it :)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #507 +/- ##
==========================================
- Coverage 89.21% 88.55% -0.66%
==========================================
Files 90 90
Lines 16294 16541 +247
==========================================
+ Hits 14536 14648 +112
- Misses 1758 1893 +135 ☔ View full report in Codecov by Sentry. |
Just added more tests to cover more lines and the memory issue in the schema test should be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits, but this looks good from my end!
@felipecrv are there any more REE-specific details relevant to this PR (or follow-ups that we need to open issues for and tackle before the next release?)
Not that I'm aware right now. I recommend that you and @cocoa-xu take a look at the C++ code around REEs to see all the traps. I don't remember everything that I did, but I left many comments in the C++ implementation. https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/ree_util.h |
Tests for #507 and #501 and/or #503 either used C++17 or features from Arrow C++ > 12. Our test suite still supports these (although perhaps parts of this support should be dropped soon). On Windows, formatting with `%lu` was doing some unexpected formatting. We could do a better job formatting 64-bit integers in error messages (e.g., using `PRId64` and the requisite defines to ensure it works on mingw); however, we probably won't ever be able to support properly formatting an unsigned 64-bit integer on every platform we support. I changed the error message (and its test) slightly to reflect that.
I think this PR introduced a regression that valgrind is complaining about. You can see it started back in the weekly Meson build: https://github.com/apache/arrow-nanoarrow/actions/runs/9432588616 A more detailed error description can be found here: Apparently Valgrind does not like something that ultimately hits this codeblock from the RunEndEncoded tests: arrow-nanoarrow/src/nanoarrow/array.c Line 1247 in f443e46
|
} | ||
last_run_end = run_end; | ||
} | ||
last_run_end = ArrowArrayViewGetIntUnsafe(run_ends_view, run_ends_view->length - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_run_end = ArrowArrayViewGetIntUnsafe(run_ends_view, run_ends_view->length - 1);
Is there any chance of run_ends_view->length being 0 here? If so I think this might invoke UB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll send a PR
Hi this PR tries to add support for run-end encoded array based on the arrow spec here, https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout.