Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-15244: [Format] Clarify that offsets are monotonic for binary l…
…ike arrays # Rationale The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue. I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic: * C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592 * arrow-rs after apache/arrow-rs#921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic * arrow2 also ensures that offsets are always monotonic. https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119 # Changes Thus I propose updating the format docs to make the monotonic offsets explicit. # Background I think @jorgecarleitao's description on apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea > I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases. > > The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead. Closes #12019 from alamb/alamb/clarify_offsets Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
- Loading branch information